-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch sentences in word2vec #535
Conversation
…n_sentences_sg (FAST_VERSION).
Simplify job loop + merge latest gensim
The improvement on short sentences is nice, but not nearly as great as observed by @gojomo 's batching experiments (which I can't find anymore -- can you give the link Gordon?). |
I put my partial work on this (Doc2Vec batching) in PR #536 to be easily found. |
Some more experiments: same dataset, same settings, but using CBOW ( 10,000-word sentences 10-word sentences (again, each result comes from only a single run, so there may be considerable variance) @gojomo this CBOW lift on short documents is more in line with your earlier experiments. |
Aha... before batching, each sentence presented to CBOW essentially creates I think my crude early tests (just concatenating sets of 10+ texts without code changes) were likely plain DBOW Doc2Vec (with no word-training and thus no 'window' of any kind) – so the NN-examples were again One upshot: if you repeat the SG tests with a smaller window (such as '2') you may see a bigger relative speedup from batching. (Similarly, smaller dimensions.) |
Travis fails; it looks like the change in word2vec broke some doc2vec test. @gojomo does this error ring any immediate bell to you? If not, I'll dig deeper. I thought the API didn't change at all though, so not sure what I missed. |
Looks like it's getting a tuple rather than a 'document'-shaped object (something with |
@tmylk I'm seeing an unrelated unit test error from "keywords" again -- can you fix it? |
This PR changes the way "jobs" are processed in word2vec:
The results are 1:1 bit-for-bit identical, when controlling for the exact same settings (alpha decay etc), so this is purely an internal refactoring / optimisation. The actual results can be different though, because the job batches are of different size than before, causing a slightly different alpha decay.
Benchmarks on the text8 corpus
256 dim, 1 worker, 10,000-word long sentences
256 dim, 4 workers, 10,000-word long sentences
256 dim, 1 worker, 10-word long sentences
256 dim, 4 workers, 10-word long sentences
(each result comes from only a single run, so there may be considerable variance)
TODO: