-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
match word2vec.c defaults (& option names? & command-line switches?) more closely #534
Comments
Thanks! Btw |
I'd generally argue in favour of keeping gensim APIs more consistent with each other, than with external pieces of software, mostly because there's no guarantee about the stability of parameters/defaults of software you don't control (though I admit I don't know much about word2vec.c, i.e. whether it's updated often, whether it's stable, etc.). I'd also argue than when using a module, internal consistency is also very important. E.g. if gensim uses 'workers' as a parameter in most of its classes and 'threads' in word2vec (to match word2vec.c), that could cause some confusion to those familiar with gensim, but new to word2vec. Amongst many people I know that use gensim's word2vec, most haven't used the original C implementation, having gone straight to gensim/python for ease of use (though this is just anecdotal, I can't speak for wider usage of course :) ). An easy starting point might just be to add some additional documentation, e.g. "If you want to replicate the original word2vec.c settings, use gensim parameters X." Another alternative could be to add something like a I guess one thing to consider might be whether to try and optimise the experience of a) new or existing gensim users or b) those trying to replicate/compare directly with word2vec.c. |
They are not exclusive sets: to optimize experience of new users and replicate word2vec.c results. |
I created a PR to change the default cbow_mean to 1 instead of 0 (#538). This is consistent with word2vec.c behaviour. |
Shouting in from the sidelines, sorry if OT. It would be great if the documentation could go into some depth about how each of these parameters might affect the end results. |
Per thread https://groups.google.com/d/msg/gensim/ggCHGncd5rU/Z_pQDD69AAAJ, we may want to reduce confusion for people transitioning to/from the original word2vec.c by...
(1) matching its defaults exactly
...or even further...
(2) changing our option names to match theirs in the couple of cases where they disagree
...or maybe even...
(3) giving word2vec.py a
main()
that allows it to be invoked exactly likeword2vec
to trigger the same training/saving.The text was updated successfully, but these errors were encountered: