-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast object counting + Phrases #1446
Conversation
18ff52b
to
2f69e10
Compare
@gojomo @menshikh-iv @jayantj WDYT? This could be a starting point for the new API, to be optimized and then used in Phrases and Dictionary. |
Using preshed seems to bring a ~2x speedup over plain |
""" | ||
for document in documents: | ||
for item in self.doc2items(document): | ||
self.hash2cnt[self.hash(item)] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That this implementation ignores hash-collisions is important detail to document.
for idx in range(1, l): | ||
h2 = chash(document[idx]) | ||
hash2cnt[h2] += 1 | ||
hash2cnt[h1 + h2] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple addition to determine hash of bigram gives (A, B) and (B, A) same hash value – problematic for Phrase-promotion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had XOR here before, but then A ^ A == 0 == B ^ B, which is also not good. The hashing requires a little more thought, too hackish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use h1 ** h2
, but this will lead to overflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a native operation (instruction), so probably not a good idea.
Google shows me this for boost::hash_combine
size_t hash_combine( size_t lhs, size_t rhs ) {
lhs^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
return lhs;
}
That looks fancy, but probably anything that breaks the symmetry would be enough, such as 3*lhs + rhs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a repo with a bunch of useful hashing algorithms & their comparison: https://github.com/rurban/smhasher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more interesting resource on hashing (in the context of Bloom filters / min-sketch counting in #508 ): https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
CC @isamaru
Worth testing against Certainly worth trying as drop-in for Phrases - and If/when cythonized code or preshed-ized code gives identical output on representative inputs, in less time or memory, replace existing steps. |
More interested in feedback for the API now, not performance. Is the interface complete, anything else we should support? Do you envision any issues when plugging this into Dictionary? Or into other use-cases that need fast&memory-efficient counting? Btw I tried Counter too (see commit history), but it's quite a bit slower than defaultdict. |
On second thoughts, how about simply mimicking the API of stdlib Just call it Some |
About the API, I had the same thing in mind, using the same API as Also not sure about Also, it looks like you were really on fire! :) |
# XXX: Or use a fixed-size data structure to start with (hyperloglog?) | ||
while self.max_size and len(self) > self.max_size: | ||
self.min_reduce += 1 | ||
utils.prune_vocab(self.hash2cnt, self.min_reduce) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can potentially be very slow, as very utils.prune_vocab
call is going to iterate over the entire vocabulary. Could be worthwhile to store a min_count
property. Depending on the data structures used, this might not be trivial though.
Anyway, the discussion right now is API/design and not optimization, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would this min_count
property do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd want to see profiling that indicates this survey is a major time-cost before spending much time optimizing. If it needed to be optimized, it might be possible to retain a min_reduce
/min_count
threshold and, during tallying, track when a key's tally exceeds this count, perhaps removing it from a 'most-eligible-for-pruning' list.
With this existing algorithm, where each prune, everything under the count reached last time is automatically discarded, this hypothetical "most eligible" list could be discarded 1st – without even looking at the existing counts again. Maybe that'd be enough... though if keeping this 'escalating min_reduce' you'd still need to iterate over all the remaining keys. (It seems a lot like GC, with a 'tenured' set that might be able to skip some collections.)
But as I've mentioned in previous discussions around Phrases: I suspect this 'escalating floor prune' is a somehat undesirable algorithm, injecting more imprecision/bias in final counts than necessary, and (maybe) losing a lot of keys that would survive a truly unbounded counting method. (TLDR: because of typical freq-distributions plus the always-escalating floow, every prune may wind up eliminating 90%+ of all existing words, starving some words of the chance to ever survive a prune, or in some cases inefficiently leaving the final dict at a small fraction of the desired max_size
.) Maybe it's necessary to prevent too-frequent prunes, but I'd prefer an option to be more precise via less-frequent prunes – especially if I'd hypothetically minimized the full-time-cost of Phrases-analysis via other process improvements or effective parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're on to optimizations and choosing algorithms now, so let me tick off the "API design" checkbox.
API conclusion: mimic the Counter API, except add a parameter that allows restricting its maximum size, ideally in bytes. Make it clear approximations were necessary to achieve this, and what the trade-offs are (Counter methods we cannot support, perf-memory balance implications).
@jayantj @menshikh-iv do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky I agree with you
Also, maybe need to add a static method for a size calculation (for example, if we store X tokens what's megabytes we needed for this in RAM). It's can be useful if we choose a max_size
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though the best place for that is probably the docstring. Something like max_size==1GB gives you room for approximately 100,000,000 items, and the number scales linearly with more/less RAM
(or whatever the case may be).
self.min_reduce += 1 | ||
utils.prune_vocab(self.hash2cnt, self.min_reduce) | ||
|
||
def get(self, item, default=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__getitem__
would be good too.
@jayantj Without I think the point of this class is that it can process very large iterables (corpora), so the |
self.hash2cnt = defaultdict(int) | ||
|
||
def hash(self, item): | ||
return hash(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use mmh3 for this purposes?
# XXX: Or use a fixed-size data structure to start with (hyperloglog?) | ||
while self.max_size and len(self) > self.max_size: | ||
self.min_reduce += 1 | ||
utils.prune_vocab(self.hash2cnt, self.min_reduce) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky I agree with you
Also, maybe need to add a static method for a size calculation (for example, if we store X tokens what's megabytes we needed for this in RAM). It's can be useful if we choose a max_size
parameter.
for idx in range(1, l): | ||
h2 = chash(document[idx]) | ||
hash2cnt[h2] += 1 | ||
hash2cnt[h1 + h2] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use h1 ** h2
, but this will lead to overflow
Additional experiments and timings from @jayantj , building on top of this PR: https://github.com/jayantj/gensim/blob/fast_counter_modifications/docs/notebooks/counter_benchmarks.ipynb In general, we want to be benchmarking on (much) larger corpora, such as the EN Wikipedia, so that the vocabulary pruning and memory limiting functionality have a chance to kick in. |
Continued by @isamaru in a dedicated repo: https://github.com/RaRe-Technologies/bounded_counter |
Create a new class for fast, memory-efficient object counting (bounded RAM footprint, even for super large corpora).
To be used in Phrases, Dictionary, Word2vec etc. Basically everywhere where a vocabulary larger than RAM can be expected (long tail, Zipf's law).
Perhaps we can release this as a separate, stand-alone Python library (widely useful even outside of gensim)?