Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match word2vec.c defaults (& option names? & command-line switches?) more closely #534

Open
gojomo opened this issue Nov 19, 2015 · 5 comments
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation

Comments

@gojomo
Copy link
Collaborator

gojomo commented Nov 19, 2015

Per thread https://groups.google.com/d/msg/gensim/ggCHGncd5rU/Z_pQDD69AAAJ, we may want to reduce confusion for people transitioning to/from the original word2vec.c by...

(1) matching its defaults exactly

...or even further...

(2) changing our option names to match theirs in the couple of cases where they disagree

...or maybe even...

(3) giving word2vec.py a main() that allows it to be invoked exactly like word2vec to trigger the same training/saving.

@piskvorky piskvorky added documentation Current issue related to documentation difficulty easy Easy issue: required small fix labels Nov 19, 2015
@piskvorky
Copy link
Owner

Thanks!

Btw word2vec.py already has a non-public main, used for testing. We can definitely replace that with something more flexible and robust.

@davechallis
Copy link
Contributor

I'd generally argue in favour of keeping gensim APIs more consistent with each other, than with external pieces of software, mostly because there's no guarantee about the stability of parameters/defaults of software you don't control (though I admit I don't know much about word2vec.c, i.e. whether it's updated often, whether it's stable, etc.).

I'd also argue than when using a module, internal consistency is also very important. E.g. if gensim uses 'workers' as a parameter in most of its classes and 'threads' in word2vec (to match word2vec.c), that could cause some confusion to those familiar with gensim, but new to word2vec.

Amongst many people I know that use gensim's word2vec, most haven't used the original C implementation, having gone straight to gensim/python for ease of use (though this is just anecdotal, I can't speak for wider usage of course :) ).

An easy starting point might just be to add some additional documentation, e.g. "If you want to replicate the original word2vec.c settings, use gensim parameters X."

Another alternative could be to add something like a new_with_word2vec_c_settings classmethod returning a Word2Vec instance using whatever the current word2vec.c settings.

I guess one thing to consider might be whether to try and optimise the experience of a) new or existing gensim users or b) those trying to replicate/compare directly with word2vec.c.

@Hugo-W
Copy link

Hugo-W commented Nov 19, 2015

They are not exclusive sets: to optimize experience of new users and replicate word2vec.c results.
You would also optimize experience of new Gensim users by having consistent parameters. Obviously it is better to share names across the Gensim's API, but the defaults values are those you try at first.
And honestly, I found Gensim's word2vec after word2vec.c and I think that it is generalizable to most non-gensim users. As the classic way leading to it is either after reading Mikolov's paper which explicitly refers to its available code, or by googling "word2vec" and end up on the first link being "Google's code for Word2Vec"...
Then a day later you want a python version of it :)

@akutuzov
Copy link
Contributor

I created a PR to change the default cbow_mean to 1 instead of 0 (#538). This is consistent with word2vec.c behaviour.

@timClicks
Copy link
Contributor

Shouting in from the sidelines, sorry if OT. It would be great if the documentation could go into some depth about how each of these parameters might affect the end results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

6 participants