Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Issue with MT model training from pretrained model (Vector size mismatch) #144

Open
athuljayson opened this issue Jul 29, 2019 · 8 comments

Comments

@athuljayson
Copy link

I tried to train an unsupervised MT model using the script below. The script was obtained from the repo readme.

python train.py --exp_name unsupMT_enfr --dump_path ./dumped/ --reload_model '../models/mlm_enfr_1024.pth,../models/mlm_enfr_1024.pth' --data_path ./data/processed/en-fr/ --lgs 'en-fr' --ae_steps 'en,fr' --bt_steps 'en-fr-en,fr-en-fr' --word_shuffle 3 --word_dropout 0.1 --word_blank 0.1 --lambda_ae '0:1,100000:0.1,300000:0' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 2000 --batch_size 32 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-fr_mt_bleu,10' --validation_metrics 'valid_en-fr_mt_bleu'  

I created the binarised en-fr dataset also by following the documentation.

However, I am getting an error as follows:-

size mismatch for embeddings.weight: copying a param with shape torch.Size([64139, 1024]) from checkpoint, the shape in current model is torch.Size([60374, 1024]).
size mismatch for pred_layer.proj.weight: copying a param with shape torch.Size([64139, 1024]) from checkpoint, the shape in current model is torch.Size([60374, 1024]).
size mismatch for pred_layer.proj.bias: copying a param with shape torch.Size([64139]) from checkpoint, the shape in current model is torch.Size([60374]).

Any help with the same would be much appreciated. Thanks

@glample
Copy link
Contributor

glample commented Jul 29, 2019

Hi,

It seems that the binarized dataset you are using was created without the BPE codes. Did you download the provided BPE en-fr codes and preprocessed the dataset with them? Be careful: if you initially preprocessed the dataset without them, and downloaded them afterwards, the script will not do anything as it will already see a .pth file and assume there is nothing to do. I would recommend removing everything (except for the downloaded raw corpora) and restart the preprocessing step if you still have issues.

@athuljayson
Copy link
Author

@glample Thanks! That helped. However, I am still facing GPU memory issues while starting training.

RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 1; 5.94 GiB total capacity; 5.37 GiB already allocated; 1.31 MiB free; 52.46 MiB cached)

Do you reckon it will be possible to train the MT model using a couple of GTX 1060s(6 GB VRAM each)?
Also, what would be the lowest values of batch size and tokens_per_batch that I can set so as not to affect model performance?

@glample
Copy link
Contributor

glample commented Jul 30, 2019

I'm afraid 6 GB is not enough.. you can try, but I'm not sure that even with batch size 1 it will fit. Or you would need to use less layers / a smaller dimension than the pretrained models with 6 layers / dimension 1024.

@athuljayson
Copy link
Author

athuljayson commented Jul 30, 2019

oh, I see. Then it looks like I will have to do the training on a cloud platform.

May I ask what is the recommended amount of GPU RAM for training an MT model from the provided pre-trained model?

@glample
Copy link
Contributor

glample commented Jul 30, 2019

16 GB is good. 32 GB is even better if you can (larger batches work better, and you can fit larger batches with 32 GB).

@athuljayson
Copy link
Author

@glample On a related note, is it possible to run train.py on a CPU machine with 32 GB RAM? If so, what parameters need to be set?

I checked out the code but couldn't find any way to start training with the CPU.

@glample
Copy link
Contributor

glample commented Jul 31, 2019

Hi,

I never tried, but I expect that the model would be too slow to run on CPU. I don't think you can train anything on CPU in a reasonable amount of time.

@Tikquuss
Copy link

I'm stuck with this mistake, help me get through it.

I am running the Unsupervised machine translation from a pretrained cross lingual langauge model in google colab. The language model is succesfully trained, while the MT training breaks with the following error:

Illegal division by zero at /content/XLM/src/evaluation/multi-bleu.perl line 154, line 10.
WARNING - 04/22/20 10:51:57 - 0:02:45 - Impossible to parse BLEU score!

By looking at the hypothesis files (hyp0.?-?.valid.txt, hyp0.?-?.test.txt ... in dumped/unsupMT?-?/???????/hypotheses) which are supposed to contain the translations produced by the mt at that time, I noticed that they are all empty, while the reference files (ref.?-?.valid.txt, ref.?-?.test.txt) which are supposed to contain the target translations, are not empty.

This is consistent with the error because lines 153-154-155 of the multi-bleu.perl file contain these :

if ($length_translation<$length_reference) {
$brevity_penalty = exp(1-$length_reference/$length_translation);
}

So the question is why the hypothesis files are empty.
I've done some digging in train.py and src/trainer.py with no luck.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants