-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] phrases multicore using joblib threading #1433
Conversation
@prakhar2b did you talk to @menshikh-iv and @jayantj ? Unfortunately this is not what we want. |
@piskvorky It's a part of GSoC proposal, label 1.4 |
Yes, we want multicore, but joblib is not the right tool. Joblib uses multiprocessing, and as I explained earlier, that is a bad choice of granularity when the operation to be done is as simple as incrementing a counter. The queueing/pickling/inter-process communication overhead will be enormous. |
I completely agree that multiprocessing is not a good solution due to the overheads/copying involved. We discussed trying out a multi-threading approach instead (joblib seems to allow this, although the GIL will have to be deal with). One idea was to use libcuckoo since it seems to allow for concurrent read/writes. |
I suspect multiprocessing might be a competitive approach in the particular case where each process can open its own reader a into a disjoint range of the corpus – and thus the only IPC is tiny summary counts, not bulk ranges of text. So it might only be a strategy where the corpus is large, and the user sophisticated enough to have already structured their corpus as some uncompressed file or set-of-many-smaller-files. |
Yes, that's the case where we create several counters independently and |
closing this PR as parallelizing using joblib threading doesn't improve the performance of pure python code and Phrases module has nothing much to cythonize other than static typing which doesn't result in desirable performance improvement. Also, ref - this comment , this comment above For fast counter, there is another PR #1446 in gensim , hopefully parallelizing will be better suited there. |
No description provided.