GitHub - thammegowda/014-udhr-dataset: Parallel dataset, aligned from United Nations' Universal Declaration of Human Rights (UDHR)

UDHR Dataset

The goal of this project is to create a multi-parallel dataset by aligning universal declaration of human rights (UDHR) documents. UDHR contains translations for over 500+ languages, which is a valuable resource for development and testing of CL/NLP tools. For example, Unicode uses this corpus to test encoding: https://unicode.org/udhr/translations.html

Current state of the dataset:

View on Google Drive

Downloading docs

We use XML files which are properly encoded to unicode.

From unicode: https://unicode.org/udhr/translations.html Bulk download links: https://unicode.org/udhr/downloads.html

mkdir data/xmls
cd data/xmls
wget https://unicode.org/udhr/assemblies/udhr_xml.zip
unzip udhr_xml.zip

Setup

Python 3.7+
xmltodict: which can be installed from pip install xmltodict
uroman: https://github.com/isi-nlp/uroman

Parse: XML to TSV

mkdir -p data/tsvs

# data/xmls/udhr_eng.xml -> data/tsvs/udhr_eng.tsv
for i in data/xmls/udhr_*.xml; do echo $i;
   ./udhr_parser.py -i $i -o ${i//xml/tsv};
done

Romanize

git clone git@github.com:isi-nlp/uroman.git
mkdir data/romanized/
# data/tsvs/udhr_eng.tsv -> data/rmonaized/udhr_eng.tsv
for i in data/tsvs/udhr_*.tsv; do echo $i;
   uroman/bin/uroman.pl < $i > ${i/tsvs/romanized}
done

Run Aligner

python udhr_align.py -i data/tsvs -o UDHR-align.v1

$ ll -1 UDHR-align.v1.*
   -rw-r--r-- 1 tg staff 6.3M Oct 28 04:05 UDHR-align.v1.tsv
   -rw-r--r-- 1 tg staff 2.3M Oct 28 04:05 UDHR-align.v1.xlsx
$

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
.gitignore		.gitignore
README.adoc		README.adoc
UDHR-align.v2.tsv		UDHR-align.v2.tsv
UDHR-align.v2.xlsx		UDHR-align.v2.xlsx
UDHR-align.v2T.tsv		UDHR-align.v2T.tsv
play.ipynb		play.ipynb
requirements.txt		requirements.txt
skipped-recs.tsv		skipped-recs.tsv
udhr_align.py		udhr_align.py
udhr_parser.py		udhr_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UDHR Dataset

Downloading docs

Setup

Parse: XML to TSV

Romanize

Run Aligner

About

Releases

Packages

Languages

thammegowda/014-udhr-dataset

Folders and files

Latest commit

History

Repository files navigation

UDHR Dataset

Downloading docs

Setup

Parse: XML to TSV

Romanize

Run Aligner

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages