Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

AssertionError in training unsupervised MT #201

Open
lihongzheng-nlp opened this issue Sep 16, 2019 · 7 comments
Open

AssertionError in training unsupervised MT #201

lihongzheng-nlp opened this issue Sep 16, 2019 · 7 comments

Comments

@lihongzheng-nlp
Copy link

Following the instruction, I trained unsupervised MT as follow:
python train.py --exp_name unsupMT_zh-en --dump_path ./dumped/ --reload_model 'best-valid_mlm_ppl.pth,best-valid_mlm_ppl.pth' --data_path path/to/data --lgs 'zh-en' --ae_steps 'zh,en' --word_dropout 0.1 --word_blank 0.1 --word_shuffle 3 --lambda_ae '0:1,100000:0.1,300000:0' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 32 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --stopping_criterion 'valid_zh-en_mt_bleu,10' --validation_metrics 'valid_zh-en_mt_bleu'

when adding the parameter --eval_bleu true, an AssertionError is shown as following:
Traceback (most recent call last): File "train.py", line 323, in <module> check_data_params(params) File "XLM/src/data/loader.py", line 320, in check_data_params assert params.eval_bleu is False or len(params.mt_steps + params.bt_steps) > 0 AssertionError
but when not adding the parameter --eval_bleu true, it ran successfully, until the end of epoch 0, it gave another AssertionError:
WARNING - 09/16/19 15:22:29 - 0:11:28 - Metric "valid_zh-en_mt_bleu" not found in scores!
Traceback (most recent call last):
File "train.py", line 327, in
main(params)
File "train.py", line 306, in main
trainer.end_epoch(scores)
File "XLM/src/trainer.py", line 598, in end_epoch
assert metric in scores, metric
AssertionError: valid_zh-en_mt_bleu

So what's the problem? How to ran successfully with the parameter --eval_bleu true, and don't show the AssertionError Metric "valid_zh-en_mt_bleu" not found in scores! Thank you very much!

@glample
Copy link
Contributor

glample commented Sep 16, 2019

The problem is that you are not training your model to do MT here, only auto-encoding with --ae_steps 'zh,en'. You need to add --mt_steps 'zh-en,zh-en' to do MT if you have parallel datasets, and --bt_steps 'zh-en-zh,en-zh-en' for the back-translation.

Please check https://github.com/facebookresearch/XLM#train-on-unsupervised-mt-from-a-pretrained-model

AssertionError: valid_zh-en_mt_bleu means that you told the script to save the best model based on the valid_zh-en_mt_bleu metric: --validation_metrics 'valid_zh-en_mt_bleu'. But if you don't evaluate BLEU --eval_bleu false, this metric will not exist at the end of each epoch, and the model cannot use it so it will raise an error.

@lihongzheng-nlp
Copy link
Author

lihongzheng-nlp commented Sep 17, 2019

@glample Thanks for your help! after adding both --mt_steps 'zh-en,zh-en' and --bt_steps 'zh-en-zh,en-zh-en', it raised the following error:
File "XLM/src/data/loader.py", line 267, in check_data_params
assert len(params.mt_steps) == len(set(params.mt_steps))
AssertionError

When replacing them with --mt_steps 'zh-en' --bt_steps 'zh-en-zh' or --mt_steps 'zh-en', it raised following error:
INFO - 09/17/19 09:52:56 - 0:00:02 - ============ Parallel data (en-zh)
Traceback (most recent call last):
File "train.py", line 327, in
main(params)
File "train.py", line 230, in main
data = load_data(params)
File "/XLM/src/data/loader.py", line 343, in load_data
load_para_data(params, data)
File "/XLM/src/data/loader.py", line 192, in load_para_data
src_data = load_binarized(src_path, params)
File "/XLM/src/data/loader.py", line 66, in load_binarized
assert os.path.isfile(path), path
AssertionError: data/zh-en-0814/fast_bpe/train.en-zh.en.pth

But I don't have the file train.en-zh.en.pth under data directory. I processed the data following the official instruction, the processed data includes follow files:
├── codes
├── test.en.pth -> test.zh-en.en.pth
├── test.zh-en.en
├── test.zh-en.en.pth
├── test.zh-en.zh
├── test.zh-en.zh.pth
├── test.zh.pth -> test.zh-en.zh.pth
├── train.en
├── train.en.pth
├── train.zh
├── train.zh.pth
├── valid.en.pth -> val.zh-en.en.pth
├── valid.zh-en.en
├── valid.zh-en.zh
├── valid.zh.pth -> val.zh-en.zh.pth
├── val.zh-en.en.pth
├── val.zh-en.zh.pth
├── vocab.en
├── vocab.zh
└── vocab.zh-en

I was wondering whether the parameter --bt_steps is always necessary during training? Would you please give me some further detailed guide on above errors to train the model successfully? I'm just a little confused with so many parameters. Thank you!

@aconneau
Copy link

aconneau commented Sep 20, 2019

--mt_steps 'zh-en,zh-en' -> replace with --mt_steps 'zh-en,en-zh' for both directions. Back-translation (bt_steps) is an essential part of the UnsupMT algorithm so yes it is necessary. Please refer to the related papers for its impact on the quality of UnsupMT.

When you say that you downloaded the data following the official instruction, I think you're mistaking the preprocessing of XNLI versus the preprocessing for UnsupervisedMT which are in two different chapters of the README. We didn't provide en-zh preprocessing scripts for UnsupervisedMT. If you're doing it, please consider adapting the files in your path to the files you're supposed to have as if you were doing "./get-data-nmt.sh --src en --tgt ro" for example. If you don't have parallel datasets and are doing UnsupMT, remove the "mt_steps".

@lihongzheng-nlp
Copy link
Author

@aconneau Thankf for your detailed infomation. I'm sure I downloaded the data for UnsupervisedMT with ./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr. Maybe I need to go through the whole pipeline from scratch.
By the way, I have been keeping reading your excellent works. Thank you!

@JxuHenry
Copy link

JxuHenry commented Nov 1, 2019

@VictorLi2017 Sir,Have you trained the zh-en model ? What is the BLEU value?

@conquerSelf
Copy link

--mt_steps 'zh-en,zh-en' -> replace with --mt_steps 'zh-en,en-zh' for both directions. Back-translation (bt_steps) is an essential part of the UnsupMT algorithm so yes it is necessary. Please refer to the related papers for its impact on the quality of UnsupMT.

When you say that you downloaded the data following the official instruction, I think you're mistaking the preprocessing of XNLI versus the preprocessing for UnsupervisedMT which are in two different chapters of the README. We didn't provide en-zh preprocessing scripts for UnsupervisedMT. If you're doing it, please consider adapting the files in your path to the files you're supposed to have as if you were doing "./get-data-nmt.sh --src en --tgt ro" for example. If you don't have parallel datasets and are doing UnsupMT, remove the "mt_steps".

hello!I follow your step to train unsupervisied MT after finishing XLM pre-training,but i meet the following AssertionError:
Traceback (most recent call last):
File "train.py", line 337, in
main(params)
File "train.py", line 290, in main
trainer.mt_step(lang, lang, params.lambda_ae)
File "/data/zsj/cgj/xlm_umt/XLM/src/trainer.py", line 851, in mt_step
enc1 = self.encoder('fwd', x=x1, lengths=len1, langs=langs1, causal=False)
File "/home/zsj/anaconda3/envs/cgj1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/zsj/anaconda3/envs/cgj1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/zsj/anaconda3/envs/cgj1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/zsj/anaconda3/envs/cgj1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/zsj/anaconda3/envs/cgj1/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
AssertionError: Caught AssertionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/zsj/anaconda3/envs/cgj1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/zsj/anaconda3/envs/cgj1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/data/zsj/cgj/xlm_umt/XLM/src/model/transformer.py", line 326, in forward
return self.fwd(**kwargs)
File "/data/zsj/cgj/xlm_umt/XLM/src/model/transformer.py", line 346, in fwd
assert lengths.size(0) == bs
AssertionError
I try some ways to solve it but fail.Can you help me to solve the problem?Thank you so much!

@hcd7434
Copy link

hcd7434 commented Nov 25, 2020

Following the instruction, I trained unsupervised MT as follow:
python train.py --exp_name unsupMT_zh-en --dump_path ./dumped/ --reload_model 'best-valid_mlm_ppl.pth,best-valid_mlm_ppl.pth' --data_path path/to/data --lgs 'zh-en' --ae_steps 'zh,en' --word_dropout 0.1 --word_blank 0.1 --word_shuffle 3 --lambda_ae '0:1,100000:0.1,300000:0' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 32 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --stopping_criterion 'valid_zh-en_mt_bleu,10' --validation_metrics 'valid_zh-en_mt_bleu'

when adding the parameter --eval_bleu true, an AssertionError is shown as following:
Traceback (most recent call last): File "train.py", line 323, in <module> check_data_params(params) File "XLM/src/data/loader.py", line 320, in check_data_params assert params.eval_bleu is False or len(params.mt_steps + params.bt_steps) > 0 AssertionError
but when not adding the parameter --eval_bleu true, it ran successfully, until the end of epoch 0, it gave another AssertionError:
WARNING - 09/16/19 15:22:29 - 0:11:28 - Metric "valid_zh-en_mt_bleu" not found in scores!
Traceback (most recent call last):
File "train.py", line 327, in
main(params)
File "train.py", line 306, in main
trainer.end_epoch(scores)
File "XLM/src/trainer.py", line 598, in end_epoch
assert metric in scores, metric
AssertionError: valid_zh-en_mt_bleu

So what's the problem? How to ran successfully with the parameter --eval_bleu true, and don't show the AssertionError Metric "valid_zh-en_mt_bleu" not found in scores! Thank you very much!

What is the ppl of MLM when you train the zh-en pre-training model, and is the BLEU obtained as NMT or UNMT?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants