README

The project is designed test various models for translation Swedish sentences to English. It uses Keras to define the models, and Tensorflow as the backend. Various models are tested, along with different tokenizing options and optimizers. The results are scored using the BLEU unigram method which looks for the occurences of all keywords but doesn't consider word order. The highest results achieved so far is a score of 0.23 out of a maximum of 1.00. The low scores are probably based on the limited training set.

Environment

The development of this project assumes Python 3.6, and the easiest way to set up the correct packages is via Anaconda:

Hardware

The models train much better with an Nvidia GPU. This is a standard Tensorflow requirement for performant neural net training, and needs the CUDA drivers (amongst others) to work.

Install Conda

Download and install via https://www.anaconda.com/download/

conda create --name nmt-keras python=3.6
source activate nmt-keras

You may need to activate extra channels:

conda config --add channels conda-forge 
conda install <package-name>

Windows considerations

Ensure that you are running the 64-bit version of Python.

You may need to increase the virtual memory available to the OS via the Advanced Settings > Performance Settings > Advanced > Virtual memory settings.

Install hunspell and dictionaries

On Mac

brew install hunspell
cd ~/Library/Spelling
wget http://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.aff
wget http://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.dic
wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/sv_SE/sv_SE.aff
wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/sv_SE/sv_SE.dic

On Windows

Download https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/hunpos/hunpos-1.0-win.zip to thirdparty\hunpos-win

Install POS-Tagging models for Swedish

wget http://stp.lingfil.uu.se/~bea/resources/hunpos/suc-suctags.model.gz
gunzip suc-suctags.model.gz

wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/hunpos/en_wsj.model.gz
gunzip en_wsj.model.gz

Install Python packages

Use pip to read from requirements.txt

Download training set

Download content at http://www.manythings.org/anki/swe-eng.zip

Automatic installer

Most of the steps above should work on Mac/Linux by running the init.sh script.

Libraries used

Pyphen

Pyphen is a pure Python module to hyphenate text using included or external Hunspell hyphenation dictionaries.

Graphviz

On Mac, install with brew install graphviz

python

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
tf.VERSION

Background information

Many of the assumptions of this project for Swedish are similiar to any German to English translation model, since Swedish also heavily uses noun compounding, although Swedish's morphology is much simpler than German.

Schiller [2005] find that in a newspaper corpus in German, 5.5% of all tokens and 43% of all types were compounds. Anne Schiller. For example, the German word for eye drops is Augentropfen, consisting of Auge (eye), tropfen (drop), and an n in the middle. A Schulbuch (school book) consists of Schule (school) and Buch (book), but the final e of Schule is removed when using it as the first part of a compound.

In a few languages, such as German, Dutch, Hungarian, Greek, and the Scandinavian languages the resulting words, or compounds, are written as a single word without any special characters or whitespace in between. The most frequent use of compounding by far [Baroni et al., 2002] is compounds consisting of two nouns, but adjectives and verbs form compounds as well.

Koehn and Knight, 2003 learn splitting rules from both monolingual as well as parallel corpora. They generate all possible splits of a given word, and take the one that maximizes the geometric mean of the word frequencies of its parts, although they find that this process often leads to both oversplitting words into more common parts (e.g. Freitag (friday) into frei (free) and tag (day)), as well as not splitting some words that should be split, because the compound is more frequent than the geometric mean of the frequencies of its parts.

Soricut and Och, 2015 use vector representations of words to uncover morpho- logical processes in an unsupervised manner. Their method is language-agnostic, and can be applied to rare words or out-of-vocabulary tokens (OOVs). Morphological transformations (e.g. rained = rain + ed) are learned from the word vectors themselves.

Problems

Much of the example code online uses over-simplication to achieve any useful results. This often involves lower-casing all text, removing punctuation and accents etc. For this project I wanted to see what was possible with commodity GPUs while preserving punctuation and case.

I find this important since (for example) commas will change semantics and exclamation the emphasis. And obviously capitalization normally marks proper nouns. Therefore, part of the preparation phase is to normalize the input, and only lowercase the first word in a sentece if it is not a proper noun. This reduces the word space since in principle any word would appear twice in the training set: once in the middle of the sentence, e.g. eating in "I like eating", vs. Eating in "Eating is a necessity". This could increase the model complexity and required training by a power of 2.

Tokenization

This project contains several tokenizer options in tokenizers.py:

Tokenizer class	Description
`SimpleLines`	Tokenize on spaces and punctuation, and keep punctuation but not spaces
`Hyphenate`	Use a hypenate library to break do `SimpleLines` then break down longer words into sub-parts to reduce the dimensionality
`Word2Phrase`	Tokenize as above, but combine popular phrases into a single token, e.g. "good-bye" and "Eiffel Tower" will be single tokens.
`ReplaceProper`	Tokenize, but replace proper nouns with a placeholder so we don't pollute the model with possibly unlimited named entities.
`PosTag`	Tokenize as above, but run a part-of-speech tagger on each sentence. This will result in token like "duck.VF" and "duck.NN" for verbs and nouns.
`LetterByLetter`	Tokenize the sentence into individual words, preserving all spaces and punctuation. This gives a very small input space but should be possible for an attention-based model.

Acknowledgements

Initial code was from https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/
Parallel Corpora, Parallel Worlds: Selected Papers from a Symposium on Parallel and Comparable Corpora at Uppsala University, Sweden, 22-23 April, 1999
Anne Schiller. 2005. German compound analysis with wfsc. In International Workshop on Finite-State Methods and Natural Language Processing, pages 239–246. Springer.

Future work

The results will probably be better with training taken from the larger Europarl, Wikipedia or GlobalVoices corpora:

Europarl
Wikipedia
Global Voices
EMEA corpus is a parallel corpus based on documents by the European Medicines Agency.

Other avenues to explore:

Add another tokenization option to denote context. Perhaps add a pre-token such as H for heading, W for written sentence, S for spoken sentence, L for label (like a button).
Deal with contractions. See https://github.com/kootenpv/contractions
Test an unsupervised sub-word tokenizer with something like sentencepiece to create a fixed word space (vocab size) using sub-words
Test a syllable-based tokenizer (English specific) or a naive language-general approach of splitting after every vowel (or multiple vowels). For languages without vowels (Arabic, Chinese) just split after each character. Masters thesis by Jonathan Oberl̈ander "Splitting Word Compounds"
Test a suffix-tokenization method. We could extract a list of suffixes for each language from Wiktionary (a sister project of Wikipedia): We simply take all page titles in the Category:language prefixes and Category:language suffixes and remove the dash at the beginning of each page title.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
models		models
neumt		neumt
report		report
tests		tests
thirdparty		thirdparty
.dockerignore		.dockerignore
Dockerfile		Dockerfile
README.md		README.md
clean.py		clean.py
config.py		config.py
data.py		data.py
dataset.py		dataset.py
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
evaluate.py		evaluate.py
grammar.py		grammar.py
helpers.py		helpers.py
hunpos-tag		hunpos-tag
hunpos-train		hunpos-train
init-dl-data.sh		init-dl-data.sh
init-mac.sh		init-mac.sh
init.sh		init.sh
metadata.py		metadata.py
plot_loss_bleu.py		plot_loss_bleu.py
plot_trainings.py		plot_trainings.py
prepare.py		prepare.py
requirements.txt		requirements.txt
results.md		results.md
run.sh		run.sh
setup.py		setup.py
test-tf.py		test-tf.py
tokenizers.py		tokenizers.py
tokenizers_summary.py		tokenizers_summary.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Environment

Hardware

Install Conda

Windows considerations

Install hunspell and dictionaries

On Mac

On Windows

Install POS-Tagging models for Swedish

Install Python packages

Download training set

Automatic installer

Libraries used

Pyphen

Graphviz

python

Background information

Problems

Tokenization

Acknowledgements

Future work

About

Releases

Packages

Languages

niccottrell/nmt-keras

Folders and files

Latest commit

History

Repository files navigation

README

Environment

Hardware

Install Conda

Windows considerations

Install hunspell and dictionaries

On Mac

On Windows

Install POS-Tagging models for Swedish

Install Python packages

Download training set

Automatic installer

Libraries used

Pyphen

Graphviz

python

Background information

Problems

Tokenization

Acknowledgements

Future work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages