This repository contains all data and documentation for building a neural machine translation system for English to Vietnamese.
The IWSLT'15 English-Vietnamese data is used from Stanford NLP group.
For all experiments the corpus was split into training, development and test set:
Data set | Sentences | Download |
---|---|---|
Training | 133,317 | via GitHub or located in data/train-en-vi.tgz |
Development | 1,553 | via GitHub or located in data/dev-2012-en-vi.tgz |
Test | 1,268 | via GitHub or located in data/test-2013-en-vi.tgz |
A NMT system for English-Vietnamese is built with the tensor2tensor library. This problem was officially added with this pull request.
The following training steps are tested with tensor2tensor in version 1.13.4.
First, we create the initial directory structure:
mkdir -p t2t_data t2t_datagen t2t_train t2t_output
In the next step, the training and development datasets are downloaded and prepared:
t2t-datagen --data_dir=t2t_data --tmp_dir=t2t_datagen/ \
--problem=translate_envi_iwslt32k
Then the training step can be started:
t2t-trainer --data_dir=t2t_data --problem=translate_envi_iwslt32k \
--model=transformer --hparams_set=transformer_base --output_dir=t2t_output
The number of GPUs used for training can be specified with the --worker_gpu
option.
We use checkpoint averaging with the built-in t2t-avg
tool:
t2t-avg-all --model_dir t2t_output/ --output_dir t2t_avg
In the next step, the test dataset is downloaded and extracted:
wget "https://github.com/stefan-it/nmt-en-vi/raw/master/data/test-2013-en-vi.tgz"
tar -xzf test-2013-en-vi.tgz
Then the decoding step for the test dataset can be started:
t2t-decoder --data_dir=t2t_data --problem=translate_envi_iwslt32k \
--model=transformer --decode_hparams="beam_size=4,alpha=0.6" \
--decode_from_file=tst2013.en --decode_to_file=system.output \
--hparams_set=transformer_base --output_dir=t2t_avg
The BLEU-score can be calculated with the built-in t2t-bleu
tool:
t2t-bleu --translation=system.output --reference=tst2013.vi
The following results can be achieved using the (normal) Transformer model. Training was done on a NVIDIA RTX 2080 TI for 50k steps.
Model | BLEU (Beam Search) |
---|---|
Luong & Manning (2015) | 23.30 |
Sequence-to-sequence model with attention | 26.10 |
Neural Phrase-based Machine Translation Huang et. al. (2017) | 27.69 |
Neural Phrase-based Machine Translation + LM Huang et. al. (2017) | 28.07 |
Transformer (Base) | 28.54 (cased) |
Transformer (Base) | 29.44 (uncased) |
To reproduce the reported results, a pretrained model can be downloaded using:
wget https://schweter.eu/cloud/nmt-en-vi/envi-model.avg-250000.tar.xz
The pretrained model has a (compressed) filesize of 553M. After the download process, the archive must be uncompressed with:
tar -xJf envi-model.avg-250000.tar.xz
All necessary files are located in the t2t_export
folder.
The pretrained model can be invoked by using the --checkpoint_path
commandline argument of the t2t-decoder
tool. E.g. the complete
command for the test dataset using the pretrained model is:
t2t-decoder --data_dir=t2t_data --problem=translate_envi_iwslt32k \
--model=transformer --decode_hparams="beam_size=4,alpha=0.6" \
--decode_from_file=tst2013.en --decode_to_file=system.output \
--hparams_set=transformer_base \
--checkpoint_path t2t_export/model.ckpt-250000
This repository was mentioned and citet in the NeurIPS paper Adaptive Methods for Nonconvex Optimization by Zaheer et al. (2018).
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).