Sampling-Based Minimum Bayes-Risk Decoding for Neural Machine Translation
git clone git@github.com:Roxot/mbr-nmt.git
cd mbr-nmt
pip install .
mbr-nmt translate -s sample_filename -n num_samples -u utility -o output_filename [-c candidates_file]
Because sentences are not translated in order, the output of mbr-nmt translate
should be ordered correctly first. This can be done using mbr-nmt convert
as follows:
mbr-nmt convert -f mbr-nmt -i mbr_nmt_translate_output -o translations_file
mbr-nmt convert -f fairseq -i fairseq-output -o samples-300.en --merge-subwords --detruecase --detokenize
mbr-nmt translate -s samples-300.en -n 300 -u beer -o translations.en.mbr
mbr-nmt convert -f mbr-nmt -i translations.en.mbr -o translations.en
The samples file should contain an equal amount of translation samples per input sentence that you wish to translate. Translation samples should be separated by a newline. Example:
samples-3.en:
<input-1-sample-1>
<input-1-sample-2>
<input-1-sample-3>
<input-2-sample-1>
<input-2-sample-2>
<input-2-sample-3>
...
We provide support for parsing the output of fairseq when sampling to convert the fairseq output into a format readable by mbr-nmt translate
using mbr-nmt convert
. See mbr-nmt convert --help
for more info. Example:
mbr-nmt convert -f fairseq -i fairseq_output_file -o samples_file --merge-subwords --detruecase --detokenize
Optionally, one can provide a separate list of candidates from which translations are to be selected. If not set, unique samples will be used as candidates instead. The candidates file is more flexible than the samples file in that it supports different number of candidates per input sentence. Each set of candidates must be preceded by the line NC=<number_of_candidates>
and candidates should be separated by a newline. Example:
candidates.en
NC=2
<input-1-candidate-1>
<input-1-candidate-2>
NC=3
<input-2-candidate-1>
<input-2-candidate-2>
<input-2-candidate-3>
...
See mbr-nmt translate --help
for a full list of utilities currently supported. Some utilities require to be installed first.
Follow the installation instructions at https://github.com/stanojevic/beer and set the $BEER_HOME
environment variable. Make sure that $BEER_HOME
is visible to mbr-nmt translate