-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the objective is normal? #8
Comments
That depends entirely on your dataset. It seems a bit low compared to what I usually get, but it could be reasonable in your case. The only way to know it is to somehow evaluate your embeddings, although manually checking the nearest neighbors of a few words is enough to check that the system is learning something. The mapping method itself does not have any hyperparameter, so there is nothing to explore there. However, you may want to tune the hyperparameters of the embeddings themselves, try different normalization options, or play with the training corpus and dictionary, which could all make a considerable difference. |
I manually build a dictionary containing several word pairs for the translation test. The coverage is 100% and the accuracy is 0. Why the accuracy is 0. |
I obviously don't know if you don't give more details. What was your training set (language pair, corpus, embeddings, dictionary...)? What commands did you run to learn the mapping and evaluate it? |
language pair is English to Chinese, corpus contains 200w sentences. dictionary only contains five word pairs. I run with the command "python3 eval_translation.py train.en.txt.remBlank.tok.bpe.lf.50.mono.vectors.normalized.mapped train.zh.seg.txt.remBlank.bpe.lf.50.mono.vectors.normalized.mapped -d test_dic" |
the test_dict is: |
So the embeddings were trained in only 200 sentences? That's way too little to get anything reasonable. The training dictionary of only 5 word pairs seems too small as well. In our paper we report positive results starting at 25 word pairs. |
@artetxem No, the embeddings are trained in 200w(2000000) sentences. I have expanded the dictionary to 25 words, the accuracy is still 0. Maybe may test dictionary is still too small? |
Your test dictionary is indeed very small, and it might be that you also need a larger training dictionary for English-Chinese. I would also recommend you to try the numeral-based initialization, I would expect it to be more robust assuming that there are arabic numerals in the Chinese training corpus. Also, how did you train your embeddings? What is your vocabulary size? |
@ artetxem Yes, I am utilizing the nemeral-based initialization and the vocabulary size for our model is 30000. I will test it with a bigger test dictionary. Thank you . |
@ZhenYangIACAS Hi, Have you solved the problem? |
@ZhenYangIACAS @JiqiangLiu 运行命令行示例(无监督训练 en2zh 时, 传递命令行参数 --unsupervised_vocab 8000 才能得到比较好的效果): |
I ran the code on my dataset, and the objective I got is 32.5354% after 67 iteration. It is normal? How should I finetune the parameters?
The text was updated successfully, but these errors were encountered: