This is the implementation of the approaches described in the paper:
Emanuele Bugliarello, Sabrina J. Mielke, Antonios Anastasopoulos, Ryan Cotterell and Naoaki Okazaki. It’s Easier to Translate out of English than into it: Measuring Neural Translation Difficulty by Cross-Mutual Information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, July 2020.
You can clone this repository with submodules included issuing: git clone --recurse-submodules git@github.com:e-bug/nmt-difficulty
The requirements can be installed by setting up a conda environment:
conda env create -f environment.yml
followed by source activate nmt
The pre-processing steps to generate our data sets are as follows:
cd scripts/data
./download_data.sh
./preprocess_data.sh
./binarize.sh
(for Fairseq)
You may want to update the default data directories used in the provided files.
Scripts for training and evaluating each model are provided in scripts/experiments
.
You can easily run these scripts for each experiment by entering its directory (e.g. experiments/en2de
) and running the corresponding script (e.g. ./test.sh
).
Note that we trained our models on a SGE cluster but we also provide the associated Bash file (e.g. train_mt.sh
).
-
experiments/
Contains the following scripts to train and evaluate each model:train.sh
: train the LM/MT modelvalid.sh
: validate the modeltest.sh
: test the modelxmi_valid.sh
(MT only): evaluate XMI on the validation setxmi_test.sh
(MT only): evaluate XMI on the test set
-
fairseq-0.6.2/
Our code is based on Fairseq (version 0.6.2). Here, we introduce the following two files to evaluate our approximation of the cross-entropy of a model: -
results/
: collects CSV files aggregating the values of each evaluated metric -
scripts/
: main scripts, divided into the following subdirectories (you may want to update data and checkpoints directories in these files):data/
: contains scripts for data generationexperiments/
: contains scripts for training and evaluating modelsresults/
: contains scripts to generate the CSV files inresults/
as well as correlation coefficients and our bar plot.
-
tools/
: third-party software (i.e. Moses and BPE)
This work is licensed under the MIT license. See LICENSE
for details.
Third-party software and data sets are subject to their respective licenses.
If you find our code/models or ideas useful in your research, please consider citing the paper:
@inproceedings{bugliarello-etal-2020-easier,
title = "It{'}s Easier to Translate out of {E}nglish than into it: {M}easuring Neural Translation Difficulty by Cross-Mutual Information",
author = "Bugliarello, Emanuele and
Mielke, Sabrina J. and
Anastasopoulos, Antonios and
Cotterell, Ryan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.149",
pages = "1640--1649",
}