Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiprocessing support for BM25 #2146

Merged
merged 15 commits into from
Aug 13, 2018

Conversation

Shiki-H
Copy link
Contributor

@Shiki-H Shiki-H commented Aug 6, 2018

I realized that computing BM25 for large corpus can be quite time consuming, so I added multiprocessing support for the original BM25. I did not open an issue as this is a very quick fix and thought that I would probably make a PR directly. I also added a test to verify the result of using multiprocessing is identical to the original approach. Probably not many people use this function, but hopefully it helps :)

@@ -152,14 +154,36 @@ def get_scores(self, document, average_idf):
return scores


def get_bm25_weights(corpus):
def _get_scores(bm25, document, average_idf):
"""helper function for retrieving bm25 scores in parallel"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use numpy-style docstrings (here and everywhere)

elif n_jobs is None:
return 1
elif n_jobs < 0:
n_jobs = max(cpu_count() + 1 + n_jobs, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this for n_jobs < 0 case, probably should be cpu_count() - 1, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv I was trying to stick to sklearn style here, where n_jobs=-1 means using all cores. In fact, this way of determining the number of effective jobs was borrowed from sklearn which can be found here. Please let me know if you think this is ok.

Copy link
Contributor

@menshikh-iv menshikh-iv Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that's fine (through cpu_count implementation different, but I don't worry about it)

get_score = partial(_get_scores, bm25, average_idf=average_idf)
pool = Pool(n_processes)
weights = pool.map(get_score, corpus)
pool.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange order, you close and join after, why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv I came across this SO question a while ago and learned that one actually need to call close before using join. This can also be found in python's official docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

weights1 = get_bm25_weights(common_texts)
weights2 = get_bm25_weights(common_texts, n_jobs=2)
weights3 = get_bm25_weights(common_texts, n_jobs=-1)
self.assertEqual(weights1, weights2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be assertAlmostEqual instead (never compare floating point values using "strict" equal)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv That was my bad. Fixed now.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good @Shiki-H, please fix last review comments and I'll merge this PR

return scores


def _effective_n_jobs(n_jobs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this function to gensim.utils and rename it to effective_n_jobs (looks useful for later usage).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv thanks. I have fixed them.

Returns
-------
int
number of effective jobs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n -> N + . at the end of sentence

@menshikh-iv
Copy link
Contributor

Thanks @Shiki-H, congratz with first contribution 🥇

@menshikh-iv menshikh-iv changed the title BM25 with multiprocessing support Add multiprocessing support for BM25 Aug 13, 2018
@menshikh-iv menshikh-iv merged commit 466b32f into piskvorky:develop Aug 13, 2018
@Shiki-H
Copy link
Contributor Author

Shiki-H commented Aug 14, 2018

@menshikh-iv Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants