Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize word2vec.predict_output_word for speed #3153

Merged
merged 9 commits into from
Jul 19, 2021

Conversation

M-Demay
Copy link
Contributor

@M-Demay M-Demay commented May 18, 2021

Motivation: in a project of mine I was doing repeated calls to gensim.models.word2vec.Word2vec.predict_output_word and it was taking very long (really too long). I dug into the source code and found out that there was a call to the built-in function sum. I replaced it with a call to np.sum.
As I was doing many calls to gensim.models.word2vec.Word2vec.predict_output_word, I thought I was calling gensim.models.word2vec.wv.get_index redundantly. Consequently I wanted to add the possibility for the user (me in the first place) to call gensim.models.word2vec.Word2vec.predict_output_word with parameter context as a list of word indices, not words. So I implemented this as well. In the process I added several sanity checks for the sake of code robustness and made the API compatible with the existing behaviour (for example using logger.warning). I also updated the doc of the method to explicit the new behaviour.
Note that line 1842 of gensim/models/word2vec.py (version I am pushing) I removed the condition if not word2_indices, because it was already tested before.

In the end I gained a factor about 4 in percall time spent in gensim.models.word2vec.Word2vec.predict_output_word by replacing sum by np.sum in an experiment of mine. With the second change (context as a list of indices, not words) I gained 33% percall execution time, for an overall time gain very appreciable. I didn't investigate further this time gain, but these experiments were made on a single run with several thousand calls to gensim.models.word2vec.Word2vec.predict_output_word.

Remark: I did $ tox -e py36-linux and the last 4 tests in gensim/test/test_api.py failed. However, note that they also failed before I did my changes. They seem to be network-related (I am working behind a proxy which daily causes me trouble) and I do not think my changes affected them. No other test failed.

Suggestion: unit tests might be added to check / improve code robustness in the case when the user gives context as a list of int. I did not take time to do it so far.

Could you please notify me if and when these changes are released in official versions (especially with pip) ? Thank you for reviewing !

Mathis added 2 commits May 18, 2021 10:43
…ed a call to sum to numpy.sum to gain performance.
…sibility for the user to input a list of word indices as parameter 'context' instead of a list of words.
@piskvorky
Copy link
Owner

piskvorky commented May 18, 2021

Thanks. When you say I gained a factor about 4 in percall time, what do you mean? What is the total performance gain on some standard corpus (like wikipedia, or text9), is it noticable?

The np.sum fix seems innocuous (fine), but the indices code path I'm mild -1 about. Trading 33% speed-up for such increase of code complexity is not worth it IMO. @mpenkov @gojomo WDYT?

@M-Demay
Copy link
Contributor Author

M-Demay commented May 19, 2021

Here are some results done using my computer, using cProfile to profile my scripts.

np.sum first_tests.zip compares calls to predict_output_word with and without the np.sum part on the dataset 20newsgroups (accessible with sklearn). In both cases I ran a script on the first 10 samples of the dataset. I did not run it on more data at that step, because it was taking time and I found the difference was already meaningful. When I say 'I gained a factor about 4 in percall time' I mean that the number in the last percall column was divided by roughly 4 (from 0.01314 to 0.00295) from test_performance.profile (without numpy) to test_performance_numpy.profile (with numpy). My script basically calls predict_output_word on each word to see if the model is able to predict it accurately or not. The other parameters are set to their default values.

str vs int input for context context_str.zip reports experiments I did for the same behaviour (calls to predict_output_word on every word, same script running), but comparing with context as list of str (in test_performance_20ng_1000samples_str.profile) and list of int (in test_performance_20ng_1000samples_idx.profile). This time I tested it on the first 1000 samples of the training subset of 20newsgroups, which is about 11000 samples large in total. The last percall column shows a decrease from 0.0068 to 0.0040 s/call for predict_output_word, resulting in a total gain of about 10 min on a total of 30 min for the total script execution (Actually there are computations that your implementation did that I now do out of predict_output_word, which explains that the overall script execution time gain is not as large as the time gained only on predict_output_word). I call it meaningful in the case of my experiment. On the other hand, I understand your argument. I know it is not a big improvement, but I had to code it in order to assess the performance gain, so now here it is, and I submitted it anyway to have your opinion.

context_str.zip
first_tests.zip

@piskvorky
Copy link
Owner

Thanks for the details, but can you post the overall numbers? From a larger run, end-to-end. Because isolated micro-benchmarks of "from 0.01314 to 0.00295" can be misleading. Thanks!

@M-Demay
Copy link
Contributor Author

M-Demay commented May 19, 2021

I ran my script on the first 100 samples of 20newsgroups. Here are the profiles. In short, the total amount of time goes from 485s, ie 8min, to 183s (3min). If you look at the profiles in detail it is crystal clear: the 5min are gained on calls to sum.
The next most time-consuming methods are the listcomp on which I tried to play in my other commit, then np.dot and np.argsort which are already highly optimized.

np_gain.zip

@piskvorky
Copy link
Owner

That's great, thanks. I'm +1 on this – @mpenkov @gojomo please review.

gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
logger.warning("All input context word indices must be non-negative.")
return None
# take only the ones in the vocabulary
word2_indices = word2_indices[word2_indices < self.wv.vectors.shape[0]]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
word2_indices = word2_indices[word2_indices < self.wv.vectors.shape[0]]
max_index = self.wv.vectors.shape[0]
word2_indices = np.array([index for index in contex_words_list if 0 <= index < max_index])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change only aim at making things clearer ? Doesn't it slow the code down with this list operation instead of numpy-only operations ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally a project like Gensim is interested in 1st, correctness; 2nd, clarity; then 3rd, performance - and even then, mainly considering performance with respect to avoiding really-slow/wasteful approaches, or choosing equally-clear simple optimizations, or optimizing heavily-used code-paths that show up as being a problem in real-use-case profiling.

A single list-filter-iteration-via-a-list-comprehension is a very common Python idiom that should more-or-less be considered "costless" except for if-and-when you have special reason to have intense performance concerns about its likely uses. And here, since this method was motivated by the simulation of a word2vec "context window", and such windows are usually pretty small (default window=5 means a 10-word inout, at most), juggling a tiny list is unlikely to be an important bottleneck.

(But again: I suspect suggestions below may make this moot. And it's possible, if use of this method became common in ways we didn't expect – such as using giant input lists, or in a high-volume tight-loop that's key to some tasks' overall performance – it might become a concern. But those problems, if they materialize, might outgrow this method entirely.)

gensim/models/word2vec.py Outdated Show resolved Hide resolved
else:
# then, words were passed. Retrieve their indices
word2_indices = [self.wv.get_index(w) for w in context_words_list if w in self.wv]
if not word2_indices:
Copy link
Owner

@piskvorky piskvorky May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it affect the performance if you put the "integer indices" check after the "string indices"?

That is, "if strs, convert to ints => continue always with ints, prune 0 <= index < vectors.shape[0], test ints for all-empty, etc". It would unify the two code paths, no need to duplicate code and test twice.

More DRY & readable code.

Copy link
Owner

@piskvorky piskvorky May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or indeed, like @gojomo says, unify even more and accept mixed strings-and-ints inputs, in a single code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about performance but less duplicates sounds righter.

Copy link
Collaborator

@gojomo gojomo May 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think the most-compact & flexible code would accept a list-of-words-or-word-int-indexes, then convert it to all indexes (by replacing any strings with ints, in a single list pass).

If any bad indexes are passed, or words not in the model, whatever errors those unchecked conditions trigger are probably clear enough: IndexError or KeyError, with a stack through the caller code, and this method's code, that uses a bad parameter. So essentially no overhead of parameter-checking.

I'm especially willing to consider this low-overhead approach because I'd categorize this method as 'advanced' and 'experimental' - it's not a typical use/function of Word2Vec. The method only works for one mode (negative-sampling), and doesn't (without weighting) truly approximate what a training-time negative-sampling prediction would be. The need to sort the activation values of all potential vocabulary words before returning the top-N means it's an inherently fairly-expensive operation. As such, it already has many caveats (that should be reflected in its doc-comment) - and the extra expectation that it should only be called with sensible arguments is no extra burden.

But, also, that KeyedVectors.get_index() will already take a string-or-int, and return an int index:

https://github.com/RaRe-Technologies/gensim/blob/29fecbfc38a0f0f0783fe80e674f008a2620da30/gensim/models/keyedvectors.py#L383-L396

So I think in fact choosing to use that .get_index() method would ensure a single-pass over the word-list would always through a descriptive KeyError if either the string-word isn't present, or the int-index is out-of-range.

@gojomo
Copy link
Collaborator

gojomo commented May 19, 2021

Using np.sum() for a quick speed-up seems great. Accepting int indexes for advanced users is OK.

But, this much parameter-checking is atypical for Gensim, and lengthens the method quite a bit. My sense is if someone's advanced enough to be passing ints, checking for bounds isn't necessary. (And maybe, negative "from-the-end" indexing might even make sense in some obscure case?)

More generally, I tend to think if the downstream errors caused by unchecked bad parameters are reasonably descriptive, it's OK to let them happen. OTOH, if none of the supplied words are known, that may be anomalous enough that raising an exception is more sensible than a logged-warning and None return-value. Or, returning an empty-set of predicted words, rather than None.

I suspect it'd be just as compact/idiomatic/efficient to allow mixed ints & strings. (Use a list comprehension to repalce any string-keys with ints.)

ASIDE: If this method were to get a deeper refactoring for broader utility:

  • it could weight by position, to mimic training prediction, rather than treat all words in the window as equally influential, which is unlike training predictions
  • it could predict at least a single word in HS mode
  • it could could make the sort optional, for potential callers who have their own reasons to know the network's prediction at certain indexes, rather than a need for top-N. (This would match the most_similar() unsorted behavior.)
  • requested features in Adding Word-to-Context Prediction in Word2Vec (inverse of predict_output_word()) #2152 (a sort of infer-conext/neighbors from prediction reverse of this) and potential Doc2Vec feature: reverse inference, to synthesize doc/summary words #2459 (generate a doc from a doc-vec very analogous to this) might share some calculations/code/internal functions
  • it might even be able to subsume the likely-broken score() functionality, for evaluating how conformant texts (or more granularly, context->word samples) are with the model

@gojomo
Copy link
Collaborator

gojomo commented May 19, 2021

ALSO: on this issue Github shows me this...
image
...but all 6 build/test tasks already show as run/successful. So unclear what 'workflow' might run if I pressed that button. Anyone know?

@M-Demay
Copy link
Contributor Author

M-Demay commented May 21, 2021

@gojomo I'm happy if you're OK with mixing ints and strs !

About parameter-checking, I did it thinking, let's not break the code too much. But for my use, all these checks are not necessary. About negative indexing, indeed it felt weird to write, since if there are negative indices, it is probably not by chance. I'm OK with on-purpose negative indexing now you say it. What if a too large int is passed ? There will be an IndexError, do you find it informative enough ? I'm in for sparing time spent in useless checks (even more as gensim is already much optimized), but I'm also for robust code.

Raising an exception if no word is known is usually OK for me... However the previous code did use logged-warnings so we should change this part too, right? It has the advantage of not interrupting the execution, and if it only happens on rare cases, it enables to process a whole corpus even if some rare parts behave badly.

I hadn't thought of mixing ints and strings... Why not, but is it really useful in any case ? At any rate, if it does not make the code too complex and make it more modular, there's no reason not to do it. The way @piskvorky describes it in a comment above feels natural.

On the broader refactoring, I don't get it all, but:

  • what my script does is assessing the performance of the model by computing how often it predicts the right word in the topn suggestions (just like assessing prediction accuracy as usual in machine learning). If this is what you mean by fixing 'the likely-broken score function', I'm somewhat working on it for my project and I was thinking I could do a pull request for it if you'd like me to (most likely opening a new one, probably in a couple of days or weeks). The difference with the existing behaviour is that it is more directly human-interpretable than score's log probability thing.
  • Also, what sort do you suggest to make optional ? The sort of predicted words by their probability ? I like the idea. About this, it looks like a pity to compute probabilities for all words in the vocabulary. Computing probabilities for thousands of words and being interested in only the top 10 ones feels like making 99% efforts for (almost) nothing, but I don't see how to avoid it.

Finally, @gojomo about your screenshot, it seems solved. If I'm guessing right, as it is my first contribution, something had to be approved and it looks like @piskvorky approved it.

@gojomo
Copy link
Collaborator

gojomo commented May 23, 2021

@M-Demay It seems both @piskvorky & me are OK with words being specified by any mix of string-keys and int-indexes – as it simplifies the code quite a bit – so it'd be good to see a PR update taking tht approach. And as noted in my per-line comment, the use of .get_index() (which has its own checks) may make any other error-checking superflous.

Regarding the broader-refactoring, those aren't things this PR would need to address – just related thoughts I wanted to capture.

The .score() function was another contributed way of judging a model's congruence with some arbitrary text corpus - perhaps the one on which it was trained, perhaps others. In the contributor's paper, it was helpful for a classification model. But that method hasn't been maintained & may not be of general usefulness - at least not compared to other potential metrics that could be made available in a more general way, especially a more-direct way to report true raw loss over some old/new texts - eg, #2617/etc.

The idea of numerically assessing a model by the % of predictions it gets right (or close in a top-N sense) is interesting & potentially more human-interpretable than a model 'loss' number... but note that among many models, the one model that gets the most predictions right may not give the best word-vectors. For example, an overfit model can have a tiny loss (or best-possible prediction-accuracy), but create word-vectors less-useful for any purposes outside that original dataset. That full prediction inherently requires evaluating the model's value for every output-vocabulary word - while the essence of 'negative sampling' is to do a tiny subset of that work that's sufficient to approach the same results.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally had a look at this. The changes make sense to me. @M-Demay please respond to the individual suggestions from @piskvorky, push your changes and then we can merge this.

…ng to trying to make it more compact and versatile.
@M-Demay M-Demay force-pushed the vectorize-predict_output_word branch from 97e687e to 84258b4 Compare May 25, 2021 09:40
@M-Demay
Copy link
Contributor Author

M-Demay commented May 25, 2021

I tried to take your comments into account (I hope I got them all right):

  • I did the single-pass refactoring.
  • However I did not use .get_index() (although @gojomo suggested it) because using it would probably lead to doing some checks twice, which might damage performance and is useless (for example, checking if words are in the accepted vocabulary). I also did this choice because @gojomo suggested to accept negative ints, which .get_index() does not allow as its first behaviour. Also, not using .get_index() on words allows to input a context with out-of-vocabulary words, which is the current behaviour of predict_output_word() and is the desired behaviour (you can still enter words that were not kept in the vocabulary without raising an error).

Remark: first I committed the comment changes from @piskvorky but then I realised it was useless since I had refactored the code, hence the git commit --amend then conflict then force push.

Feel free to request changes on the format of my comments, or on the code itself again.

gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
Mathis and others added 2 commits May 26, 2021 10:29
* Retained the suggested `sum`->`np.sum`
  replacement, which has been tested to
  yield significant runtime gains.
* Dropped unnecessary type/value checks
  that are already run when calling the
  `KeyedVectors.__isin__` dunder method.
* Corrected the docstring to accurately
  document the supported inputs (which
  were already compatible prior to the
  PR this commit is a part of).
@gojomo
Copy link
Collaborator

gojomo commented Jun 16, 2021

Current minimalist optimization (just np.sum() & improved doc-comment) looks good to me!

@piskvorky piskvorky requested a review from mpenkov June 16, 2021 19:08
@piskvorky piskvorky changed the title Vectorize predict output word [MRG] Vectorize word2vec's predict_output_word() for speed Jun 16, 2021
Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some tests for the new functionality? We need to see that specifying words via ints (and ints mixed with strings) gives predictable results.

Thank you!

@mpenkov mpenkov changed the title [MRG] Vectorize word2vec's predict_output_word() for speed Vectorize word2vec.predict_output_word for speed Jun 29, 2021
@gojomo
Copy link
Collaborator

gojomo commented Jun 29, 2021

Can you please add some tests for the new functionality? We need to see that specifying words via ints (and ints mixed with strings) gives predictable results.

Thank you!

If you look at the current diff, there's no new functionality, only a 1-liner optimization to old functionality that leaves all existing tests working. The only reference to differed functionality is that the comment has been changed to more accurately reflect what seems-like-it-should work.

I don't think that straightforward optimization should be held up awaiting new tests for latent functionality that wasn't added in this PR. if making the extra claim of mixed str-and-int working in the comment, without a test to cover it, is a concern, it'd make more sense to roll back that comment claim. That mixed str-and-int might work will then just be a bonus if someone tries it or reads the code.

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 29, 2021

if making the extra claim of mixed str-and-int working in the comment, without a test to cover it, is a concern, it'd make more sense to roll back that comment claim.

Yes, that would be also be fine.

@piskvorky
Copy link
Owner

piskvorky commented Jun 30, 2021

I think this PR turned out to have the highest comments-to-lines-changed ratio in Gensim history :) Thanks for your diligent work @M-Demay !

@M-Demay
Copy link
Contributor Author

M-Demay commented Jul 1, 2021

Well thank you @piskvorky, I did not think I would be so famous so fast.
@mpenkov I added some little tests following what you said. I basically test that the behaviour is exactly the same with the same context containing either only str, only int or mixed types.

@mpenkov mpenkov merged commit b287fd8 into piskvorky:develop Jul 19, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Jul 19, 2021

Finally merged this. Thank you @M-Demay for your effort and your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants