Issue with MT model training from pretrained model (Vector size mismatch) #144

athuljayson · 2019-07-29T11:56:11Z

I tried to train an unsupervised MT model using the script below. The script was obtained from the repo readme.

python train.py --exp_name unsupMT_enfr --dump_path ./dumped/ --reload_model '../models/mlm_enfr_1024.pth,../models/mlm_enfr_1024.pth' --data_path ./data/processed/en-fr/ --lgs 'en-fr' --ae_steps 'en,fr' --bt_steps 'en-fr-en,fr-en-fr' --word_shuffle 3 --word_dropout 0.1 --word_blank 0.1 --lambda_ae '0:1,100000:0.1,300000:0' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 2000 --batch_size 32 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-fr_mt_bleu,10' --validation_metrics 'valid_en-fr_mt_bleu'

I created the binarised en-fr dataset also by following the documentation.

However, I am getting an error as follows:-

size mismatch for embeddings.weight: copying a param with shape torch.Size([64139, 1024]) from checkpoint, the shape in current model is torch.Size([60374, 1024]).
size mismatch for pred_layer.proj.weight: copying a param with shape torch.Size([64139, 1024]) from checkpoint, the shape in current model is torch.Size([60374, 1024]).
size mismatch for pred_layer.proj.bias: copying a param with shape torch.Size([64139]) from checkpoint, the shape in current model is torch.Size([60374]).

Any help with the same would be much appreciated. Thanks

The text was updated successfully, but these errors were encountered:

glample · 2019-07-29T12:08:19Z

Hi,

It seems that the binarized dataset you are using was created without the BPE codes. Did you download the provided BPE en-fr codes and preprocessed the dataset with them? Be careful: if you initially preprocessed the dataset without them, and downloaded them afterwards, the script will not do anything as it will already see a .pth file and assume there is nothing to do. I would recommend removing everything (except for the downloaded raw corpora) and restart the preprocessing step if you still have issues.

athuljayson · 2019-07-30T09:52:19Z

@glample Thanks! That helped. However, I am still facing GPU memory issues while starting training.

RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 1; 5.94 GiB total capacity; 5.37 GiB already allocated; 1.31 MiB free; 52.46 MiB cached)

Do you reckon it will be possible to train the MT model using a couple of GTX 1060s(6 GB VRAM each)?
Also, what would be the lowest values of batch size and tokens_per_batch that I can set so as not to affect model performance?

glample · 2019-07-30T10:03:12Z

I'm afraid 6 GB is not enough.. you can try, but I'm not sure that even with batch size 1 it will fit. Or you would need to use less layers / a smaller dimension than the pretrained models with 6 layers / dimension 1024.

athuljayson · 2019-07-30T10:34:30Z

oh, I see. Then it looks like I will have to do the training on a cloud platform.

May I ask what is the recommended amount of GPU RAM for training an MT model from the provided pre-trained model?

glample · 2019-07-30T13:32:18Z

16 GB is good. 32 GB is even better if you can (larger batches work better, and you can fit larger batches with 32 GB).

athuljayson · 2019-07-31T13:20:36Z

@glample On a related note, is it possible to run train.py on a CPU machine with 32 GB RAM? If so, what parameters need to be set?

I checked out the code but couldn't find any way to start training with the CPU.

glample · 2019-07-31T17:05:48Z

Hi,

I never tried, but I expect that the model would be too slow to run on CPU. I don't think you can train anything on CPU in a reasonable amount of time.

Tikquuss · 2020-04-22T13:52:18Z

I'm stuck with this mistake, help me get through it.

I am running the Unsupervised machine translation from a pretrained cross lingual langauge model in google colab. The language model is succesfully trained, while the MT training breaks with the following error:

Illegal division by zero at /content/XLM/src/evaluation/multi-bleu.perl line 154, line 10.
WARNING - 04/22/20 10:51:57 - 0:02:45 - Impossible to parse BLEU score!

By looking at the hypothesis files (hyp0.?-?.valid.txt, hyp0.?-?.test.txt ... in dumped/unsupMT?-?/???????/hypotheses) which are supposed to contain the translations produced by the mt at that time, I noticed that they are all empty, while the reference files (ref.?-?.valid.txt, ref.?-?.test.txt) which are supposed to contain the target translations, are not empty.

This is consistent with the error because lines 153-154-155 of the multi-bleu.perl file contain these :

if ($length_translation<$length_reference) {
$brevity_penalty = exp(1-$length_reference/$length_translation);
}

So the question is why the hypothesis files are empty.
I've done some digging in train.py and src/trainer.py with no luck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with MT model training from pretrained model (Vector size mismatch) #144

Issue with MT model training from pretrained model (Vector size mismatch) #144

athuljayson commented Jul 29, 2019

glample commented Jul 29, 2019

athuljayson commented Jul 30, 2019

glample commented Jul 30, 2019

athuljayson commented Jul 30, 2019 •

edited

Loading

glample commented Jul 30, 2019

athuljayson commented Jul 31, 2019

glample commented Jul 31, 2019

Tikquuss commented Apr 22, 2020

Issue with MT model training from pretrained model (Vector size mismatch) #144

Issue with MT model training from pretrained model (Vector size mismatch) #144

Comments

athuljayson commented Jul 29, 2019

glample commented Jul 29, 2019

athuljayson commented Jul 30, 2019

glample commented Jul 30, 2019

athuljayson commented Jul 30, 2019 • edited Loading

glample commented Jul 30, 2019

athuljayson commented Jul 31, 2019

glample commented Jul 31, 2019

Tikquuss commented Apr 22, 2020

athuljayson commented Jul 30, 2019 •

edited

Loading