-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize word mover distance (WMD) computation #3163
Conversation
Looks like a straightforward & valuable optimization to me! Thanks for the contribution! (I think you meant to write, "7x speedup for short text paris and 2x speedup for long text pairs".) As an aside (not at all a blocker for this fix, just dumping some thoughts where other WMD users may find them): I suspect a bunch of other optimization for common WMD usage patterns are possible - in particular, if doing pairwise WMDs for a batch of documents, preparing the word-to-word distnace-matrix once for the superset of all their words-used, provided that doesn't grow too large, might provide a noticeable speedup. (If so, an API for this might accept a sequence of all desired comparisons, then batch as many as fit within some manageable amount of memory automatically.) I also suspect there are a bunch of small-deviations from the exact WMD algorithm that speed it up a lot, only sacrificing a little precision, perhaps making it practical on much larger docs/sets-of-docs. For example, discarding large ranges of high-frequency or low-frequency words, to work with more manageable vocabularies, or coalescing many of the words in large documents before comparisons into a smaller number of pseudoword 'grains'. As an example of the pre-comparison compression that might be tested: "while document is larger than 10 words, merge its two closest words into a single average word." Compare also: SpaCy's trick of aliasing a larger number of word-keys to a smaller number of vectors. (IIUC, when a rarer word's vector is "very close" to a more-frequent word's vector, they may discard the rarer vector, and just have that word-key point to the same surviving vector.) |
Thanks @gojomo for pointing out my mistake in the PR description and suggesting a couple of strategies for further optimization :) |
Is there any additional action required for this change to be merged in the future? |
Thanks, I think we're good. Let's wait for @mpenkov 's review and merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you for your effort @flowlight0 !
Merged. Congrats on your first PR to gensim @flowlight0 ! 🥇 |
Thanks for the gensim team's effort on reviewing and merging this PR! |
This change makes WMD computation faster by replacing a heavy nested loop for distance matrix construction with scipy's faster implementation.
I verified its performance improvement with the following micro-benchmark. It generated the next two different set of text pairs by sampling texts from 20newsgroups and measured WMD computation speed of both versions.
Benchmark
Result
The next two tables show how WMD computation performance changed. The computation speed becomes consistently faster (7x speedup for short text paris and 2x speedup for long text pairs).
Before
After
Of course, I checked this change doesn't break the current behavior by checking outputs from two versions as follows:
tox -e flake8
succeededtox -e py36-linux
succeeded(Update 2021-06-07: fixed explanation about the experimental results based on @gojomo 's comment)