We propose a template-based framework that can cope with both lexical and structural constraints for neural machine translation. We use fairseq to implement our method, and the commit code is d5f7b50
(Mon Sep 21 13:45:35 2020 -0700). Our major modification is that we update the generate.py to achieve prefix-based inference.
We use WMT17 En-Zh and WMT20 En-De as our training sets. We use En-Zh and En-De alignment test sets for evaluation. We follow Chen et al., (2021) to extract the constraints.
We use the script prepare_template.py to convert natural language sentences into template-based ones.
Prepare the template-based training data:
python prepare_template.py en zh train.bpe train.bpe.const.en-zh t2s
Prepare the template-based inference data:
python prepare_template.py en zh test.bpe test.bpe.const.en-zh t2s.infer
Here is an example:
Filename | Example |
---|---|
en | this approach , also adopted by italy , contributes to reducing costs and increasing the overall volume of funds for beneficiaries . |
zh | 这种 方式 也 被 意大利 所 采用 。 它 有助于 降低成本 , 增加 受益人 收到 的 汇款 总额 。 |
const.en-zh | italy ||| 意大利 ||| approach ||| 方式 ||| increasing ||| 增加 ||| |
t2s.en | [C0] italy [C1] approach [C2] increasing [SEP] [S0] [C1] [S1] [C0] [S2] [C2] [S3] [SEP] [S0] this [S1] , also adopted by [S2] , contributes to reducing costs and [S3] the overall volume of funds for beneficiaries . |
t2s.zh | [C0] 意大利 [C1] 方式 [C2] 增加 [SEP] [T0] [C1] [T1] [C0] [T2] [C2] [T3] [SEP] [T0] 这种 [T1] 也 被 [T2] 所 采用 。 它 有助于 降低成本 , [T3] 受益人 收到 的 汇款 总额 。 |
t2s.infer.en | [C0] italy [C1] approach [C2] increasing [SEP] [S0] [C1] [S1] [C0] [S2] [C2] [S3] [SEP] [S0] this [S1] , also adopted by [S2] , contributes to reducing costs and [S3] the overall volume of funds for beneficiaries . |
t2s.infer.zh | [C0] 意大利 [C1] 方式 [C2] 增加 [SEP] |
fairseq-preprocess \
-s $src -t $tgt \
--joined-dictionary --srcdict dict.${src}-${tgt}.txt \
--trainpref train.bpe.t2s \
--validpref valid.bpe.t2s \
--destdir train-const-bin \
--workers 128
CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train train-const-bin \
--fp16 --seed 32 --ddp-backend no_c10d \
-s $src -t $tgt \
--lr-scheduler inverse_sqrt --lr 0.0007 \
--warmup-init-lr 1e-07 --warmup-updates 4000 \
--max-update 200000 \
--weight-decay 0.0 --clip-norm 0.0 --dropout 0.1 \
--max-tokens 8192 --update-freq 1 \
--arch transformer --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--save-dir ckpts \
--tensorboard-logdir logs \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--no-progress-bar --log-format simple --log-interval 10 \
--no-epoch-checkpoints \
--save-interval-updates 1000 --keep-interval-updates 5 \
|& tee -a logs/train.log
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train train-const-bin \
--fp16 --seed 32 --ddp-backend no_c10d \
-s $src -t $tgt \
--lr-scheduler inverse_sqrt --lr 0.0005 \
--warmup-init-lr 1e-07 --warmup-updates 4000 \
--max-update 300000 \
--weight-decay 0.0 --clip-norm 0.0 --dropout 0.1 \
--max-tokens 4096 --update-freq 1 \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--save-dir ckpts \
--tensorboard-logdir logs \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--no-progress-bar --log-format simple --log-interval 10 \
--no-epoch-checkpoints \
--save-interval-updates 1000 --keep-interval-updates 5 \
|& tee -a logs/train.log
fairseq-preprocess \
-s $src -t $tgt \
--joined-dictionary --srcdict dict.${src}-${tgt}.txt \
--testpref test.bpe.t2s.infer \
--destdir infer-const-bin
genout=test
CUDA_VISIBLE_DEVICES=0 fairseq-generate infer-const-bin \
-s $src -t $tgt \
--gen-subset test \
--path ${src}-${tgt}.pt \
--lenpen 1.0 --beam 4 \
--batch-size 128 \
--prefix-size 1024 \
> $genout
We use the script restore_template.py to convert the template-based model output into natural language sentences.
grep ^H $genout |\
sed 's/H-//g' | sort -k 1 -n -t ' ' | awk -F'\t' '{print $3}' > $genout.sort
python restore_template.py test.bpe.t2s.infer.$tgt $genout.sort $genout.bpe
cat $genout.bpe |\
sed -r 's/(@@ )|(@@ ?$)//g' | tee $genout.tok |\
perl mosesdecoder/scripts/tokenizer/detokenizer.perl -l $tgt > $genout.detok
Calculate BLEU:
# English and German
cat $genout.detok | sacrebleu test.detok.$tgt | tee $genout.bleu
# Chinese
cat $genout.detok | sacrebleu -tok zh test.detok.$tgt | tee $genout.bleu
Calculate Exact Match, Window Overlap, and 1-TERm using evaluate_term_plain.py, which is adapted from this repository:
python terminology_evaluation/evaluate_term_plain.py \
--language $tgt \
--source test.$src \
--target_reference test.$tgt \
--const test.const.${src}-${tgt} \
--hypothesis $genout.tok
We conduct structurally constrained translation experiments on the dataset provided by Hashimoto et al., (2019).
We use the script prepare_template.py to convert natural language sentences into template-based ones.
Prepare the template-based training data:
python prepare_template.py en zh train.tag.spm t2s
Prepare the template-based inference data:
python prepare_template.py en zh dev.tag.spm t2s.infer
Here is an example:
Filename | Example |
---|---|
en | Let’s look at how faceting works with the number and chart widgets that we added to our <ph> classic designer </ph> <ph> dashboard </ph> . |
zh | 让 我们 查看 多 面 化 如何 使用 添加到 <ph> 经 典 设计 器 </ph> <ph> 仪表板 </ph> 的 数字 和 图表 小部件 。 |
t2s.en | [S0] <ph> [S1] </ph> [S2] <ph> [S3] </ph> [S4] [SEP] [S0] Let’s look at how faceting works with the number and chart widgets that we added to our [S1] classic designer [S2] [S3] dashboard [S4] . |
t2s.zh | [T0] <ph> [T1] </ph> [T2] <ph> [T3] </ph> [T4] [SEP] [T0] 让 我们 查看 多 面 化 如何 使用 添加到 [T1] 经 典 设计 器 [T2] [T3] 仪表板 [T4] 的 数字 和 图表 小部件 。 |
t2s.infer.en | [S0] <ph> [S1] </ph> [S2] <ph> [S3] </ph> [S4] [SEP] [S0] Let ’ s look at how faceting works with the number and chart widgets that we added to our [S1] classic designer [S2] [S3] dashboard [S4] . |
t2s.infer.zh | [T0] |
fairseq-preprocess \
-s $src -t $tgt \
--joined-dictionary --srcdict dict.${src}-${tgt}.txt \
--trainpref train.tag.spm.t2s \
--destdir train-bin \
--workers 128
CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train train-bin \
--fp16 --seed 32 --ddp-backend no_c10d \
-s $src -t $tgt \
--lr-scheduler cosine \
--lr 1e-07 --max-lr 7e-4 \
--warmup-init-lr 1e-07 --warmup-updates 8000 \
--lr-shrink 1 --lr-period-updates 32000 \
--max-update 40000 \
--weight-decay 0.001 --clip-norm 0.0 \
--dropout 0.2 --attention-dropout 0.2 --activation-dropout 0.2 \
--max-tokens 8192 --update-freq 1 \
--arch transformer --share-all-embeddings \
--encoder-embed-dim 256 --decoder-embed-dim 256 \
--encoder-ffn-embed-dim 1024 --decoder-ffn-embed-dim 1024 \
--encoder-attention-heads 4 --decoder-attention-heads 4 \
--encoder-layers 6 --decoder-layers 6 \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--save-dir ckpts \
--tensorboard-logdir logs \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.2 \
--no-progress-bar --log-format simple --log-interval 10 \
--no-epoch-checkpoints \
--save-interval-updates 500 --keep-interval-updates 5 \
|& tee -a $logs/train.log
fairseq-preprocess \
-s $src -t $tgt \
--joined-dictionary --srcdict dict.${src}-${tgt}.txt \
--testpref dev.tag.spm.t2s.infer \
--destdir infer-bin
genout=output
CUDA_VISIBLE_DEVICES=0 fairseq-generate infer-bin \
-s $src -t $tgt \
--gen-subset test \
--path ${src}-${tgt}.pt \
--lenpen 1.0 --beam 4 \
--prefix-size 1024 \
--batch-size 128 > $genout
We use the script restore_template.py to convert the template-based model output into natural language sentences.
grep ^H $genout |\
sed 's/H-//g' | sort -k 1 -n -t ' ' | awk -F'\t' '{print $3}' > $genout.sort
python restore_template.py $genout.sort $genout.spm
cat $genout.spm | spm_decode --model tag.spm.model > $genout.txt
Prepare the json file using convert2json.py, which is released by this repository:
python convert2json.py \
--input $genout.txt --output $genout.json \
--lang $tgt --type translation --split dev
Perform the evaluation using evaluate.py, which is released by this repository:
python evaluate.py --target ${src}${tgt}_${tgt}_dev.json \
--translation $genout.json | tee $genout.res
Please email wangshuo.thu@gmail.com if you have any questions, suggestions, or bug reports :)
Please cite as:
@inproceedings{Wang:2022:TemplateNMT,
title = {A Template-based Method for Constrained Neural Machine Translation},
author = {Wang, Shuo and Li, Peng and Tan, Zhixing and Tu, Zhaopeng and Sun, Maosong and Liu, Yang},
booktitle = {Proceedings of EMNLP 2022},
year = {2022},
}