The project is designed test various models for translation Swedish sentences to English. It uses Keras to define the models, and Tensorflow as the backend. Various models are tested, along with different tokenizing options and optimizers. The results are scored using the BLEU unigram method which looks for the occurences of all keywords but doesn't consider word order. The highest results achieved so far is a score of 0.23 out of a maximum of 1.00. The low scores are probably based on the limited training set.
The development of this project assumes Python 3.6, and the easiest way to set up the correct packages is via Anaconda:
The models train much better with an Nvidia GPU. This is a standard Tensorflow requirement for performant neural net training, and needs the CUDA drivers (amongst others) to work.
Download and install via https://www.anaconda.com/download/
conda create --name nmt-keras python=3.6
source activate nmt-keras
You may need to activate extra channels:
conda config --add channels conda-forge
conda install <package-name>
Ensure that you are running the 64-bit version of Python.
You may need to increase the virtual memory available to the OS via the Advanced Settings > Performance Settings > Advanced > Virtual memory settings.
brew install hunspell
cd ~/Library/Spelling
wget http://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.aff
wget http://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.dic
wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/sv_SE/sv_SE.aff
wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/sv_SE/sv_SE.dic
Download https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/hunpos/hunpos-1.0-win.zip to thirdparty\hunpos-win
wget http://stp.lingfil.uu.se/~bea/resources/hunpos/suc-suctags.model.gz
gunzip suc-suctags.model.gz
wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/hunpos/en_wsj.model.gz
gunzip en_wsj.model.gz
Use pip
to read from requirements.txt
Download content at http://www.manythings.org/anki/swe-eng.zip
Most of the steps above should work on Mac/Linux by running the init.sh
script.
Pyphen is a pure Python module to hyphenate text using included or external Hunspell hyphenation dictionaries.
On Mac, install with brew install graphviz
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
tf.VERSION
Many of the assumptions of this project for Swedish are similiar to any German to English translation model, since Swedish also heavily uses noun compounding, although Swedish's morphology is much simpler than German.
Schiller [2005] find that in a newspaper corpus in German, 5.5% of all tokens and 43% of all types were compounds. Anne Schiller. For example, the German word for eye drops is Augentropfen, consisting of Auge (eye), tropfen (drop), and an n in the middle. A Schulbuch (school book) consists of Schule (school) and Buch (book), but the final e of Schule is removed when using it as the first part of a compound.
In a few languages, such as German, Dutch, Hungarian, Greek, and the Scandinavian languages the resulting words, or compounds, are written as a single word without any special characters or whitespace in between. The most frequent use of compounding by far [Baroni et al., 2002] is compounds consisting of two nouns, but adjectives and verbs form compounds as well.
Koehn and Knight, 2003 learn splitting rules from both monolingual as well as parallel corpora. They generate all possible splits of a given word, and take the one that maximizes the geometric mean of the word frequencies of its parts, although they find that this process often leads to both oversplitting words into more common parts (e.g. Freitag (friday) into frei (free) and tag (day)), as well as not splitting some words that should be split, because the compound is more frequent than the geometric mean of the frequencies of its parts.
Soricut and Och, 2015 use vector representations of words to uncover morpho- logical processes in an unsupervised manner. Their method is language-agnostic, and can be applied to rare words or out-of-vocabulary tokens (OOVs). Morphological transformations (e.g. rained = rain + ed) are learned from the word vectors themselves.
Much of the example code online uses over-simplication to achieve any useful results. This often involves lower-casing all text, removing punctuation and accents etc. For this project I wanted to see what was possible with commodity GPUs while preserving punctuation and case.
I find this important since (for example) commas will change semantics and exclamation the emphasis. And obviously capitalization normally marks proper nouns. Therefore, part of the preparation phase is to normalize the input, and only lowercase the first word in a sentece if it is not a proper noun. This reduces the word space since in principle any word would appear twice in the training set: once in the middle of the sentence, e.g. eating in "I like eating", vs. Eating in "Eating is a necessity". This could increase the model complexity and required training by a power of 2.
This project contains several tokenizer options in tokenizers.py
:
Tokenizer class | Description |
---|---|
SimpleLines |
Tokenize on spaces and punctuation, and keep punctuation but not spaces |
Hyphenate |
Use a hypenate library to break do SimpleLines then break down longer words into sub-parts to reduce the dimensionality |
Word2Phrase |
Tokenize as above, but combine popular phrases into a single token, e.g. "good-bye" and "Eiffel Tower" will be single tokens. |
ReplaceProper |
Tokenize, but replace proper nouns with a placeholder so we don't pollute the model with possibly unlimited named entities. |
PosTag |
Tokenize as above, but run a part-of-speech tagger on each sentence. This will result in token like "duck.VF" and "duck.NN" for verbs and nouns. |
LetterByLetter |
Tokenize the sentence into individual words, preserving all spaces and punctuation. This gives a very small input space but should be possible for an attention-based model. |
-
Initial code was from https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/
-
Parallel Corpora, Parallel Worlds: Selected Papers from a Symposium on Parallel and Comparable Corpora at Uppsala University, Sweden, 22-23 April, 1999
-
Anne Schiller. 2005. German compound analysis with wfsc. In International Workshop on Finite-State Methods and Natural Language Processing, pages 239–246. Springer.
The results will probably be better with training taken from the larger Europarl, Wikipedia or GlobalVoices corpora:
- Europarl
- Wikipedia
- Global Voices
- EMEA corpus is a parallel corpus based on documents by the European Medicines Agency.
Other avenues to explore:
-
Add another tokenization option to denote context. Perhaps add a pre-token such as
H
for heading,W
for written sentence,S
for spoken sentence,L
for label (like a button). -
Deal with contractions. See https://github.com/kootenpv/contractions
-
Test an unsupervised sub-word tokenizer with something like sentencepiece to create a fixed word space (vocab size) using sub-words
-
Test a syllable-based tokenizer (English specific) or a naive language-general approach of splitting after every vowel (or multiple vowels). For languages without vowels (Arabic, Chinese) just split after each character. Masters thesis by Jonathan Oberl̈ander "Splitting Word Compounds"
-
Test a suffix-tokenization method. We could extract a list of suffixes for each language from Wiktionary (a sister project of Wikipedia): We simply take all page titles in the Category:language prefixes and Category:language suffixes and remove the dash at the beginning of each page title.