Skip to content

Parallel dataset, aligned from United Nations' Universal Declaration of Human Rights (UDHR)

Notifications You must be signed in to change notification settings

thammegowda/014-udhr-dataset

Repository files navigation

UDHR Dataset

The goal of this project is to create a multi-parallel dataset by aligning universal declaration of human rights (UDHR) documents. UDHR contains translations for over 500+ languages, which is a valuable resource for development and testing of CL/NLP tools. For example, Unicode uses this corpus to test encoding: https://unicode.org/udhr/translations.html

Current state of the dataset:

View on Google Drive

Downloading docs

We use XML files which are properly encoded to unicode.

mkdir data/xmls
cd data/xmls
wget https://unicode.org/udhr/assemblies/udhr_xml.zip
unzip udhr_xml.zip

Setup

Parse: XML to TSV

mkdir -p data/tsvs

# data/xmls/udhr_eng.xml -> data/tsvs/udhr_eng.tsv
for i in data/xmls/udhr_*.xml; do echo $i;
   ./udhr_parser.py -i $i -o ${i//xml/tsv};
done

Romanize

git clone git@github.com:isi-nlp/uroman.git
mkdir data/romanized/
# data/tsvs/udhr_eng.tsv -> data/rmonaized/udhr_eng.tsv
for i in data/tsvs/udhr_*.tsv; do echo $i;
   uroman/bin/uroman.pl < $i > ${i/tsvs/romanized}
done

Run Aligner

python udhr_align.py -i data/tsvs -o UDHR-align.v1
$ ll -1 UDHR-align.v1.*
   -rw-r--r-- 1 tg staff 6.3M Oct 28 04:05 UDHR-align.v1.tsv
   -rw-r--r-- 1 tg staff 2.3M Oct 28 04:05 UDHR-align.v1.xlsx
$

About

Parallel dataset, aligned from United Nations' Universal Declaration of Human Rights (UDHR)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published