Add multiprocessing support for `BM25` #2146

Shiki-H · 2018-08-06T03:02:56Z

I realized that computing BM25 for large corpus can be quite time consuming, so I added multiprocessing support for the original BM25. I did not open an issue as this is a very quick fix and thought that I would probably make a PR directly. I also added a test to verify the result of using multiprocessing is identical to the original approach. Probably not many people use this function, but hopefully it helps :)

menshikh-iv · 2018-08-08T02:58:17Z

gensim/summarization/bm25.py

@@ -152,14 +154,36 @@ def get_scores(self, document, average_idf):
        return scores


-def get_bm25_weights(corpus):
+def _get_scores(bm25, document, average_idf):
+    """helper function for retrieving bm25 scores in parallel"""


Please use numpy-style docstrings (here and everywhere)

menshikh-iv · 2018-08-08T02:59:36Z

gensim/summarization/bm25.py

+    elif n_jobs is None:
+        return 1
+    elif n_jobs < 0:
+        n_jobs = max(cpu_count() + 1 + n_jobs, 1)


if this for n_jobs < 0 case, probably should be cpu_count() - 1, wdyt?

@menshikh-iv I was trying to stick to sklearn style here, where n_jobs=-1 means using all cores. In fact, this way of determining the number of effective jobs was borrowed from sklearn which can be found here. Please let me know if you think this is ok.

ok, that's fine (through cpu_count implementation different, but I don't worry about it)

menshikh-iv · 2018-08-08T03:00:22Z

gensim/summarization/bm25.py

+    get_score = partial(_get_scores, bm25, average_idf=average_idf)
+    pool = Pool(n_processes)
+    weights = pool.map(get_score, corpus)
+    pool.close()


strange order, you close and join after, why?

@menshikh-iv I came across this SO question a while ago and learned that one actually need to call close before using join. This can also be found in python's official docs.

menshikh-iv · 2018-08-08T03:02:09Z

gensim/test/test_BM25.py

+        weights1 = get_bm25_weights(common_texts)
+        weights2 = get_bm25_weights(common_texts, n_jobs=2)
+        weights3 = get_bm25_weights(common_texts, n_jobs=-1)
+        self.assertEqual(weights1, weights2)


should be assertAlmostEqual instead (never compare floating point values using "strict" equal)

@menshikh-iv That was my bad. Fixed now.

menshikh-iv

looks good @Shiki-H, please fix last review comments and I'll merge this PR

menshikh-iv · 2018-08-10T10:27:56Z

gensim/summarization/bm25.py

+    return scores
+
+
+def _effective_n_jobs(n_jobs):


Can you move this function to gensim.utils and rename it to effective_n_jobs (looks useful for later usage).

@menshikh-iv thanks. I have fixed them.

menshikh-iv · 2018-08-10T10:28:18Z

gensim/summarization/bm25.py

+    Returns
+    -------
+    int
+        number of effective jobs


n -> N + . at the end of sentence

menshikh-iv · 2018-08-13T04:01:50Z

Thanks @Shiki-H, congratz with first contribution 🥇

Shiki-H · 2018-08-14T04:46:26Z

@menshikh-iv Thanks :)

Shiki-H added 11 commits August 3, 2018 00:12

added multiprocessing support for bm25

f8b6d24

added effective_n_job check

c0346cc

added comment for helper function

a066f9c

fixed minor error

567acf0

deleted unwanted comments

bbc2efe

updated example with new api

c0f03bc

updated example with new api

404c3f5

updated support for multiprocessing

385bcf5

updated docstring

95846ba

fixed typo

8ca83cc

fixed formatting

16df452

menshikh-iv suggested changes Aug 8, 2018

View reviewed changes

Shiki-H added 3 commits August 7, 2018 23:41

changed assertEqual to assertAlmostEqual

8a74d12

fixed docstrings to numpy-style

a9cf90e

removed space from blank lines

f9db849

menshikh-iv suggested changes Aug 10, 2018

View reviewed changes

moved effective_n_jobs to utils

446d6b1

menshikh-iv approved these changes Aug 13, 2018

View reviewed changes

menshikh-iv changed the title ~~BM25 with multiprocessing support~~ Add multiprocessing support for BM25 Aug 13, 2018

menshikh-iv merged commit 466b32f into piskvorky:develop Aug 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiprocessing support for `BM25` #2146

Add multiprocessing support for `BM25` #2146

Shiki-H commented Aug 6, 2018

menshikh-iv Aug 8, 2018

menshikh-iv Aug 8, 2018

Shiki-H Aug 9, 2018

menshikh-iv Aug 10, 2018 •

edited

Loading

menshikh-iv Aug 8, 2018

Shiki-H Aug 9, 2018

menshikh-iv Aug 10, 2018

menshikh-iv Aug 8, 2018

Shiki-H Aug 9, 2018

menshikh-iv left a comment

menshikh-iv Aug 10, 2018

Shiki-H Aug 12, 2018

menshikh-iv Aug 10, 2018

menshikh-iv commented Aug 13, 2018

Shiki-H commented Aug 14, 2018

Add multiprocessing support for BM25 #2146

Add multiprocessing support for BM25 #2146

Conversation

Shiki-H commented Aug 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Aug 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Aug 13, 2018

Shiki-H commented Aug 14, 2018

Add multiprocessing support for `BM25` #2146

Add multiprocessing support for `BM25` #2146

menshikh-iv Aug 10, 2018 •

edited

Loading