Skip to content

Machine translation and word embeddings of cuneiform corpuses

Notifications You must be signed in to change notification settings

Laurence-Cullen/cuneiform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neural Machine Translation of Cuneiform

Progress! Managed to get a 12.38 BLEU score translating from Sumerian transliterations to English! This was trained from 13,000 unique sentence pairs on the Google AutoML Translation platform.

Inspired by the Ashurbanipal exhibition at the British Museum and a book on cuneiform writing from my excellent friend Ellie Winter. I found that pretty much all the Cuneiform ever transliterated and translated is stored in a single digital library initiative. Some ~80,000 lines of cuneiform have been translated leaving almost 1.7 million lines for which no english translations have been generated.

This repository is a store of my efforts towards neural machine translation of languages written in cuneiform including Sumerian, Akkadian, Hittite and others. It appears that a group of several researchers have been funded to work on machine translation of these language but so far no papers have been puplished other than the one announcing the funding of the project and a high level description the NLP pipeline they plan to build. However it appears they planned to use more classical and hardcoded NLP techniques based on large quantities of human knowledge and dissection of the transliterated text. I saw the oppurtunity to use more flexible and modern techniques such as unsupervised word tokenisation and word embeddings in addition to transfer learning and attention based sequence models to attack the problem in a knowledge agnostic fashion.

I am looking to build a powerful NMT system to shine a light on the vast piles of untranslated cuneiform text using the latest research in low resource translation systems. 80,000 line might sound like a lot but typical production systems use tens of million of sentences!

Work completed so far:

  • Wrote script to grab the most recently updated cuneiform data dump from CDLI
  • Explored object catalogue statistics with the awesome pandas-profiling
  • Parsed ATF file of object transliteration and translation
  • Married transliterations and translation with catalogue metadata (language, provenence, genre, etc)
  • Built single language plain text corpuses of Akkadian, Sumerian and Hittite transliterations
  • Used Sentence Piece by Google for unsupervised text tokenising of different languages
  • Explored Open-NMT and trained a demo English to German translation model
  • Found out what was happening to the ~30,000 lines of translated text that disapears when only lines following transliterations are saved, need to sift through translations in multiple languages to pull out english ones
  • Created matched transliteration, translation file pairs
  • Hit 12.38 BLEU for Sumerian -> English translation, woop woop!
  • Researched word embeddings for use in machine translation, looks to have promise, especially for low resource languages. Loads of good ideas about using multilingual embeddings and encoders to squeze the most out of the limited datasets
  • Found tools for aligning word embeddings across languages
  • Built multilingual corpus and split out undetermined language corpus
  • Built sentence piece encodings from the entire Cuneiform transliterated dataset of 1.6 million lines
  • Tokenised monolingual Sumerian corpus with learnt sentence piece model
  • Built GloVe embeddings of tokenised Sumerian
  • Added distance.py script for exploring word embeddings, enter a word and the 100 words with the highest cosine similarity will be displayed
  • Built high performance FastText embeddings for Sumerian after some experimentation with training hyper parameters
  • Started work on seq2seq model, built infrastructure to convert language pair corpuses to tokenised arrays
  • Currently debugging sequence model architecture

About

Machine translation and word embeddings of cuneiform corpuses

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published