-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiprocessing support for BM25
#2146
Conversation
gensim/summarization/bm25.py
Outdated
@@ -152,14 +154,36 @@ def get_scores(self, document, average_idf): | |||
return scores | |||
|
|||
|
|||
def get_bm25_weights(corpus): | |||
def _get_scores(bm25, document, average_idf): | |||
"""helper function for retrieving bm25 scores in parallel""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use numpy-style docstrings (here and everywhere)
gensim/summarization/bm25.py
Outdated
elif n_jobs is None: | ||
return 1 | ||
elif n_jobs < 0: | ||
n_jobs = max(cpu_count() + 1 + n_jobs, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this for n_jobs < 0
case, probably should be cpu_count() - 1
, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv I was trying to stick to sklearn
style here, where n_jobs=-1
means using all cores. In fact, this way of determining the number of effective jobs was borrowed from sklearn
which can be found here. Please let me know if you think this is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, that's fine (through cpu_count
implementation different, but I don't worry about it)
get_score = partial(_get_scores, bm25, average_idf=average_idf) | ||
pool = Pool(n_processes) | ||
weights = pool.map(get_score, corpus) | ||
pool.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange order, you close
and join
after, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv I came across this SO question a while ago and learned that one actually need to call close
before using join
. This can also be found in python's official docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
gensim/test/test_BM25.py
Outdated
weights1 = get_bm25_weights(common_texts) | ||
weights2 = get_bm25_weights(common_texts, n_jobs=2) | ||
weights3 = get_bm25_weights(common_texts, n_jobs=-1) | ||
self.assertEqual(weights1, weights2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be assertAlmostEqual
instead (never compare floating point values using "strict" equal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv That was my bad. Fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good @Shiki-H, please fix last review comments and I'll merge this PR
gensim/summarization/bm25.py
Outdated
return scores | ||
|
||
|
||
def _effective_n_jobs(n_jobs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this function to gensim.utils
and rename it to effective_n_jobs
(looks useful for later usage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv thanks. I have fixed them.
gensim/summarization/bm25.py
Outdated
Returns | ||
------- | ||
int | ||
number of effective jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n
-> N
+ .
at the end of sentence
Thanks @Shiki-H, congratz with first contribution 🥇 |
BM25
@menshikh-iv Thanks :) |
I realized that computing BM25 for large corpus can be quite time consuming, so I added multiprocessing support for the original BM25. I did not open an issue as this is a very quick fix and thought that I would probably make a PR directly. I also added a test to verify the result of using multiprocessing is identical to the original approach. Probably not many people use this function, but hopefully it helps :)