-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch #508
Conversation
@@ -242,7 +363,7 @@ def __getitem__(self, sentence): | |||
return [utils.to_unicode(w) for w in new_s] | |||
|
|||
|
|||
if __name__ == '__main__': | |||
if __name__ == '__main__' and 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
|
||
raise ValueError("min_count should be 1") | ||
if min_count > 1: | ||
logger.warning("min_count should be 1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why retain min_count
if it now must be a constant 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I left it is not to break API (it has limited functionality to subtract this count from all frequencies before counting final bi-gram score).
I would be grateful for example how to solve the parameter removal elegantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, when the underlying behavior changes, it can be more safe/honest to break the API (give a "no such parameter min_count" error) than maintain a superficial compatibility that's no longer having the original effect (silently altering the API) or indeed fails (with a thrown error) in most situations where the old values would have been supported.
Perhaps here it's better for both implementations to co-exist, at least for a while, as different classes with slightly-different options? That may also make it easier to compare their performance, and allow any project that prefers the old precision, or needs reproducibility of prior runs, to stay with the original implementation unless/until they want the benefits of the new.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @gojomo -- we don't want to keep dead code around.
And in case of Phrases
, we don't need the "old" behaviour either. The Phrases functionality is new and not widely used yet, so a clear release note saying things have changed seems enough here.
Starting to look up, good progress! Next steps:
|
I'm pretty excited about an improved phrases-detection, so here are a bunch of random thoughts/comments: Re: benchmarks Can peak memory usage be captured, as well? (Might any of the up-to-5x-slowdowns be caused by swapping?) Are the timings just for an initial 'survey' pass, or do they also include one 'convert-to-phrases' pass? Would be very useful to see differences in vocab/bigram count, inferred-phrases, and speed/memory performance on a real corpus, like Wikipedia. (I could possibly try this sometime next week.) Re: possible optimizations It seems the 'step' (40,000,000 / 20 = 2,000,000) chunks of sentences are used to minimize calls to the (expensive?) (1) caching the list of (2) rather than using 2M-batches for precise counts, and at the end doing Can CountMinSketch objects of the same shape be added together? If so that's one plausible path to parallelization. (More generally, here and other parts of gensim may benefit from the idea of a corpus that can be read from many files, or many start-points in a file, by separate processes, to get away from the one-linear-reader-handing-items-to-many-workers pattern that doesn't work well with the Python GIL & cross-process serialization overhead.) Re: misc It might be interesting to store the unigram and bigram counts (and overall unique tallies) in separate structures, for more visibility into what's happening and to give them separate precisions. It seems the tunable count parameters (or documentation of same) should include some relationship for 'expected unique inserts'. (Don't the actual error margins depend on how saturated the structure gets, like a Bloom Filter, and beyond a certain chosen load factor the errors go beyond target levels?) |
I'll also push in other improvement to phrase detection (not related to counting, mostly perf)... probably next weekend. Since this is a major revamp, it's probably best to keep things in one place, to avoid git conflicts. |
Hi, nice too see someone continued the work. sadly, due to time/personal issues I couldn't continue. I just wanted to share, that I did profile my original code, and I found out that one of the things hitting the performance was the calculation of the hash functions. I believe some Cython magic might come handy. Is @janrygl still working on this PR right? Let me know if I can help somehow. |
@janrygl has other duties now, so this PR is not "active" at the moment :) Any help welcome! Btw there are good, fast hash functions in |
Pinging @mfcabrera @janrygl - do you think it can be part of gensim January release? |
@tmylk It depends on priorities defined by @piskvorky , I got hurt my right hand in the beginning of January and I am behind the schedule with all my projects. |
@janrygl won't have time for this in January for sure; don't know about @mfcabrera . It's a great little algorithmic project though, very pleasant. I wish I had time to tinker with this myself, would make for an exciting blog post / series! |
Relevant read: Extension to hyperloglog as used by Google. CC @tmylk |
Another interesting article https://www.linkedin.com/pulse/from-count-min-sketch-tree-23-guillaume-pitel. |
Any updates on this branch. Phrases implementation is so slow that it is making me switch to a different library for doc2vec. Any update will be helpful. |
@thescopan I don't think so -- feel free to contribute. Also, a re-implementation of the existing phrases in C/Cython would be appreciated too. It's a really small and trivial change, just one loop, but nobody did it yet. CC @tmylk . |
Cythonising Phrases will be done this summer as part of GSOC |
Awesome! This is a much needed functionality. A fast & scalable collocation (phrase) detection is sorely missing -- even in our own non-open-source projects. |
More specifically it's on @prakhar2b 's timeline for end of June. |
More discussion on hyperloglog as used inside reddit: |
Ping @janrygl, what status of this PR? Will you finish it soon? |
@menshikh-iv isn't this one of our Google Summer of Code projects this year? |
@piskvorky cythonising phrases is a project of current GSoC, but, as I understand, this is not the exactly same that the topic of this PR. |
Nearly the same -- efficient counting is the biggest challenge there. It's an extremely common task, widely useful, and that's why I'd like this to be an independent library. |
Connected with #1446 |
Re. #400 and related to #406.
Changes:
hyperloglog
for counting vocabulary size (default vocabulary size error 1%)CountMinSketch
for frequency count (if F is true freq, result is in range [F, F * (1+e)], default e = 0.01)defaultdicts
for chunk processing (it can be rewritten to parallel chunk building)Comparison of old and new implementation: