Neural Machine Translation of Cuneiform

Progress! Managed to get a 12.38 BLEU score translating from Sumerian transliterations to English! This was trained from 13,000 unique sentence pairs on the Google AutoML Translation platform.

Inspired by the Ashurbanipal exhibition at the British Museum and a book on cuneiform writing from my excellent friend Ellie Winter. I found that pretty much all the Cuneiform ever transliterated and translated is stored in a single digital library initiative. Some ~80,000 lines of cuneiform have been translated leaving almost 1.7 million lines for which no english translations have been generated.

This repository is a store of my efforts towards neural machine translation of languages written in cuneiform including Sumerian, Akkadian, Hittite and others. It appears that a group of several researchers have been funded to work on machine translation of these language but so far no papers have been puplished other than the one announcing the funding of the project and a high level description the NLP pipeline they plan to build. However it appears they planned to use more classical and hardcoded NLP techniques based on large quantities of human knowledge and dissection of the transliterated text. I saw the oppurtunity to use more flexible and modern techniques such as unsupervised word tokenisation and word embeddings in addition to transfer learning and attention based sequence models to attack the problem in a knowledge agnostic fashion.

I am looking to build a powerful NMT system to shine a light on the vast piles of untranslated cuneiform text using the latest research in low resource translation systems. 80,000 line might sound like a lot but typical production systems use tens of million of sentences!

Work completed so far:

Wrote script to grab the most recently updated cuneiform data dump from CDLI
Explored object catalogue statistics with the awesome pandas-profiling
Parsed ATF file of object transliteration and translation
Married transliterations and translation with catalogue metadata (language, provenence, genre, etc)
Built single language plain text corpuses of Akkadian, Sumerian and Hittite transliterations
Used Sentence Piece by Google for unsupervised text tokenising of different languages
Explored Open-NMT and trained a demo English to German translation model
Found out what was happening to the ~30,000 lines of translated text that disapears when only lines following transliterations are saved, need to sift through translations in multiple languages to pull out english ones
Created matched transliteration, translation file pairs
Hit 12.38 BLEU for Sumerian -> English translation, woop woop!
Researched word embeddings for use in machine translation, looks to have promise, especially for low resource languages. Loads of good ideas about using multilingual embeddings and encoders to squeze the most out of the limited datasets
Found tools for aligning word embeddings across languages
Built multilingual corpus and split out undetermined language corpus
Built sentence piece encodings from the entire Cuneiform transliterated dataset of 1.6 million lines
Tokenised monolingual Sumerian corpus with learnt sentence piece model
Built GloVe embeddings of tokenised Sumerian
Added distance.py script for exploring word embeddings, enter a word and the 100 words with the highest cosine similarity will be displayed
Built high performance FastText embeddings for Sumerian after some experimentation with training hyper parameters
Started work on seq2seq model, built infrastructure to convert language pair corpuses to tokenised arrays
Currently debugging sequence model architecture

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
CDLI_data		CDLI_data
embedding_vis		embedding_vis
embeddings		embeddings
firebase		firebase
language_pairs		language_pairs
monolingual_corpuses		monolingual_corpuses
scripts		scripts
sp_encodings		sp_encodings
tokenised_corpuses		tokenised_corpuses
xnmt_experiments		xnmt_experiments
.gitignore		.gitignore
CDLI_README.md		CDLI_README.md
README.md		README.md
atf_parser.py		atf_parser.py
cloud_translate.py		cloud_translate.py
create_language_corpuses.py		create_language_corpuses.py
distance.py		distance.py
embedding_visualiser.py		embedding_visualiser.py
explore.ipynb		explore.ipynb
find_translation_repetitions.py		find_translation_repetitions.py
process_transliterations.py		process_transliterations.py
sp_encoder_creation.py		sp_encoder_creation.py
translation_trainer.py		translation_trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Machine Translation of Cuneiform

Work completed so far:

About

Releases

Packages

Contributors 2

Languages

Laurence-Cullen/cuneiform

Folders and files

Latest commit

History

Repository files navigation

Neural Machine Translation of Cuneiform

Work completed so far:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages