Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the objective is normal? #8

Open
ZhenYangIACAS opened this issue Nov 7, 2017 · 11 comments
Open

the objective is normal? #8

ZhenYangIACAS opened this issue Nov 7, 2017 · 11 comments

Comments

@ZhenYangIACAS
Copy link

I ran the code on my dataset, and the objective I got is 32.5354% after 67 iteration. It is normal? How should I finetune the parameters?

@artetxem
Copy link
Owner

artetxem commented Nov 7, 2017

That depends entirely on your dataset. It seems a bit low compared to what I usually get, but it could be reasonable in your case. The only way to know it is to somehow evaluate your embeddings, although manually checking the nearest neighbors of a few words is enough to check that the system is learning something.

The mapping method itself does not have any hyperparameter, so there is nothing to explore there. However, you may want to tune the hyperparameters of the embeddings themselves, try different normalization options, or play with the training corpus and dictionary, which could all make a considerable difference.

@ZhenYangIACAS
Copy link
Author

I manually build a dictionary containing several word pairs for the translation test. The coverage is 100% and the accuracy is 0. Why the accuracy is 0.

@artetxem
Copy link
Owner

artetxem commented Nov 7, 2017

I obviously don't know if you don't give more details. What was your training set (language pair, corpus, embeddings, dictionary...)? What commands did you run to learn the mapping and evaluate it?

@ZhenYangIACAS
Copy link
Author

language pair is English to Chinese, corpus contains 200w sentences. dictionary only contains five word pairs. I run with the command "python3 eval_translation.py train.en.txt.remBlank.tok.bpe.lf.50.mono.vectors.normalized.mapped train.zh.seg.txt.remBlank.bpe.lf.50.mono.vectors.normalized.mapped -d test_dic"

@ZhenYangIACAS
Copy link
Author

ZhenYangIACAS commented Nov 7, 2017

the test_dict is:
word 词语
I 我
you 他
hello 你好
hi 你好
thanks 谢谢
word 词
I 我们
And the mapped embedding is got according to the example in README

@artetxem
Copy link
Owner

artetxem commented Nov 7, 2017

So the embeddings were trained in only 200 sentences? That's way too little to get anything reasonable. The training dictionary of only 5 word pairs seems too small as well. In our paper we report positive results starting at 25 word pairs.

@ZhenYangIACAS
Copy link
Author

@artetxem No, the embeddings are trained in 200w(2000000) sentences. I have expanded the dictionary to 25 words, the accuracy is still 0. Maybe may test dictionary is still too small?

@artetxem
Copy link
Owner

artetxem commented Nov 9, 2017

Your test dictionary is indeed very small, and it might be that you also need a larger training dictionary for English-Chinese. I would also recommend you to try the numeral-based initialization, I would expect it to be more robust assuming that there are arabic numerals in the Chinese training corpus. Also, how did you train your embeddings? What is your vocabulary size?

@ZhenYangIACAS
Copy link
Author

@ artetxem Yes, I am utilizing the nemeral-based initialization and the vocabulary size for our model is 30000. I will test it with a bigger test dictionary. Thank you .

@liujiqiang999
Copy link

@ZhenYangIACAS Hi, Have you solved the problem?

@IT-coach-666
Copy link

@ZhenYangIACAS @JiqiangLiu 运行命令行示例(无监督训练 en2zh 时, 传递命令行参数 --unsupervised_vocab 8000 才能得到比较好的效果):
python map_embeddings.py --unsupervised --unsupervised_vocab 8000 ./jy_data/model_en.vec ./jy_data/model_zh_j.vec ./jy_data/model_en_mapped2.vec ./jy_data/model_zh_j_mapped2.vec --cuda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants