[WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch #508

janrygl · 2015-11-03T07:37:46Z

Re. #400 and related to #406.

Changes:

Pruning words with lower frequencies is removed.
Approximative algorithms are used:
- hyperloglog for counting vocabulary size (default vocabulary size error 1%)
- CountMinSketch for frequency count (if F is true freq, result is in range [F, F * (1+e)], default e = 0.01)
Uses defaultdicts for chunk processing (it can be rewritten to parallel chunk building)
May need to update constants:
- threshold can be lower with higher vocabulary size
- chunks can be bigger (sentence is expected to contain 20 words, can be 10)
- allowed errors can be lower

Comparison of old and new implementation:

100 000 000 randomly generated sentences (10 % stop words, 90 % random)
max_vocab_size=40000000 (ignored in new implementation)

Implementation	Old	New
Time (h)	6 h 9 min	28 h 26 min
Time (ratio)	100%	462 %
Vocabulary size	29 700 928	761 363 290

20 000 000 randomly generated sentences (10 % stop words, 90 % random)
max_vocab_size=40000000 (ignored in new implementation)

Implementation	Old	New
Time (h)	2h 26 min	11 h 26 min
Time (ratio)	100%	469 %
Vocabulary size	12 692 683	309 776 447

10 000 000 randomly generated sentences (10 % stop words, 90 % random)
max_vocab_size=40000000 (ignored in new implementation)

Implementation	Old	New
Time (h)	36 min	2 h 53 min
Time (ratio)	100%	480 %
Vocabulary size	4 077 048	81 486 060

from nltk.corpus import gutenberg
guttenber.sents() (98551 sentences)

Implementation	Old	New
Time (s)	14 s	34 s
Time (ratio)	100%	240 %
Vocabulary size	618 421	619 564

from nltk.corpus import brown
brown.sents() 57340 sentences)

Implementation	Old	New
Time (s)	7 s	24 s
Time (ratio)	100%	340 %
Vocabulary size	503 281	504 632

piskvorky · 2015-11-03T08:07:56Z

gensim/models/phrases.py

@@ -242,7 +363,7 @@ def __getitem__(self, sentence):
        return [utils.to_unicode(w) for w in new_s]


-if __name__ == '__main__':
+if __name__ == '__main__' and 0:


What is this for?

gojomo · 2015-11-03T09:39:33Z

gensim/models/phrases.py

-
+            raise ValueError("min_count should be 1")
+        if min_count > 1:
+            logger.warning("min_count should be 1")


Why retain min_count if it now must be a constant 1?

The only reason I left it is not to break API (it has limited functionality to subtract this count from all frequencies before counting final bi-gram score).
I would be grateful for example how to solve the parameter removal elegantly.

In my opinion, when the underlying behavior changes, it can be more safe/honest to break the API (give a "no such parameter min_count" error) than maintain a superficial compatibility that's no longer having the original effect (silently altering the API) or indeed fails (with a thrown error) in most situations where the old values would have been supported.

Perhaps here it's better for both implementations to co-exist, at least for a while, as different classes with slightly-different options? That may also make it easier to compare their performance, and allow any project that prefers the old precision, or needs reproducibility of prior runs, to stay with the original implementation unless/until they want the benefits of the new.

Agreed with @gojomo -- we don't want to keep dead code around.

And in case of Phrases, we don't need the "old" behaviour either. The Phrases functionality is new and not widely used yet, so a clear release note saying things have changed seems enough here.

piskvorky · 2015-11-03T10:49:49Z

Starting to look up, good progress!

Next steps:

separate the MinCount/HyperLogLog logic into a separate module, decouple it from Phrases
make Phrases flexible in what counting logic it uses. I think the best way is an injection constructor param that accepts a counter object, with an API that supports MinCount / plain dict with pruning / whatever else conforms to the counter API
use the same counter object injection in word2vec / doc2vec / tfidf...
look into optimizations. The 5x perf hit is actually not that bad, I was expecting worse :) But it seems with a bit of C we could make this much faster still, perhaps even comparable. See e.g. PyBloomFilterMmap for a set of fast hashing routines doing a very similar thing.
look into parallelization. With the way it's written now, we'd have to use multithreading => C-level again to avoid the GIL. With multiprocessing, sending the large dictionaries around would likely kill the performance, so I'm not sure that's a viable path.

gojomo · 2015-11-03T20:18:29Z

I'm pretty excited about an improved phrases-detection, so here are a bunch of random thoughts/comments:

Re: benchmarks

Can peak memory usage be captured, as well? (Might any of the up-to-5x-slowdowns be caused by swapping?)

Are the timings just for an initial 'survey' pass, or do they also include one 'convert-to-phrases' pass?

Would be very useful to see differences in vocab/bigram count, inferred-phrases, and speed/memory performance on a real corpus, like Wikipedia. (I could possibly try this sometime next week.)

Re: possible optimizations

It seems the 'step' (40,000,000 / 20 = 2,000,000) chunks of sentences are used to minimize calls to the (expensive?) increment(). Two possible alternate optimizations to minimize redundant slot-hashing come to mind:

(1) caching the list of column indexes for the last-N/most-common tokens;

(2) rather than using 2M-batches for precise counts, and at the end doing increment() for all tokens then starting with a fresh precise count, use something a bit like the old capped-size dictionary but when hitting the max-size, the 'purge' does increment()s for the N purged keys. (I'm not quite sure if such a purge would be best prioritized by lowest-counts, highest-counts, or oldest-keys, in order to minimize the times the same key is increment()ed.)

Can CountMinSketch objects of the same shape be added together? If so that's one plausible path to parallelization. (More generally, here and other parts of gensim may benefit from the idea of a corpus that can be read from many files, or many start-points in a file, by separate processes, to get away from the one-linear-reader-handing-items-to-many-workers pattern that doesn't work well with the Python GIL & cross-process serialization overhead.)

Re: misc

It might be interesting to store the unigram and bigram counts (and overall unique tallies) in separate structures, for more visibility into what's happening and to give them separate precisions.

It seems the tunable count parameters (or documentation of same) should include some relationship for 'expected unique inserts'. (Don't the actual error margins depend on how saturated the structure gets, like a Bloom Filter, and beyond a certain chosen load factor the errors go beyond target levels?)

piskvorky · 2015-11-04T02:20:16Z

I'll also push in other improvement to phrase detection (not related to counting, mostly perf)... probably next weekend. Since this is a major revamp, it's probably best to keep things in one place, to avoid git conflicts.

mfcabrera · 2015-11-25T11:08:24Z

Hi, nice too see someone continued the work. sadly, due to time/personal issues I couldn't continue. I just wanted to share, that I did profile my original code, and I found out that one of the things hitting the performance was the calculation of the hash functions. I believe some Cython magic might come handy. Is @janrygl still working on this PR right? Let me know if I can help somehow.

piskvorky · 2015-11-25T11:34:27Z

@janrygl has other duties now, so this PR is not "active" at the moment :) Any help welcome!

Btw there are good, fast hash functions in pybloomfiltermmap (item 4 in my list above). There's a lot of overlap here with that project, both conceptually and implementation-wise.

tmylk · 2016-01-10T06:39:45Z

Pinging @mfcabrera @janrygl - do you think it can be part of gensim January release?

janrygl · 2016-01-11T17:23:19Z

@tmylk It depends on priorities defined by @piskvorky , I got hurt my right hand in the beginning of January and I am behind the schedule with all my projects.

piskvorky · 2016-01-12T02:07:16Z

@janrygl won't have time for this in January for sure; don't know about @mfcabrera .

It's a great little algorithmic project though, very pleasant. I wish I had time to tinker with this myself, would make for an exciting blog post / series!

piskvorky · 2016-03-21T09:39:15Z

Relevant read: Extension to hyperloglog as used by Google. CC @tmylk

piskvorky · 2016-05-02T12:27:56Z

Another interesting article https://www.linkedin.com/pulse/from-count-min-sketch-tree-23-guillaume-pitel.

thescopan · 2017-02-13T23:02:05Z

Any updates on this branch. Phrases implementation is so slow that it is making me switch to a different library for doc2vec. Any update will be helpful.

piskvorky · 2017-03-03T22:20:15Z

@thescopan I don't think so -- feel free to contribute.

Also, a re-implementation of the existing phrases in C/Cython would be appreciated too. It's a really small and trivial change, just one loop, but nobody did it yet. CC @tmylk .

tmylk · 2017-05-02T22:01:35Z

Cythonising Phrases will be done this summer as part of GSOC

piskvorky · 2017-05-14T13:05:09Z

Awesome! This is a much needed functionality.

A fast & scalable collocation (phrase) detection is sorely missing -- even in our own non-open-source projects.

tmylk · 2017-05-15T21:24:46Z

More specifically it's on @prakhar2b 's timeline for end of June.

piskvorky · 2017-05-28T02:11:59Z

More discussion on hyperloglog as used inside reddit:
https://redditblog.com/2017/05/24/view-counting-at-reddit/

menshikh-iv · 2017-06-13T08:30:58Z

Ping @janrygl, what status of this PR? Will you finish it soon?

piskvorky · 2017-06-13T08:53:57Z

@menshikh-iv isn't this one of our Google Summer of Code projects this year?

menshikh-iv · 2017-06-13T09:24:33Z

@piskvorky cythonising phrases is a project of current GSoC, but, as I understand, this is not the exactly same that the topic of this PR.

piskvorky · 2017-06-13T09:36:50Z

Nearly the same -- efficient counting is the biggest challenge there. Phrases should definitely use this new counting functionality, among other modules.

It's an extremely common task, widely useful, and that's why I'd like this to be an independent library.

menshikh-iv · 2017-06-26T08:46:36Z

Connected with #1446

Jan Rygl and others added 3 commits November 3, 2015 08:28

Add CountMinSketch alg. and hyperloglog alg.

2f11810

add tests and switch to non conservative freq prediction

06fe336

remove min_count in phrases

5d9605b

piskvorky reviewed Nov 3, 2015
View reviewed changes

janrygl force-pushed the develop branch from 78331f8 to 25976a9 Compare November 3, 2015 08:12

replace izip by six support python2/3

c002e7c

janrygl force-pushed the develop branch from 25976a9 to c002e7c Compare November 3, 2015 08:16

Jan Rygl added 2 commits November 3, 2015 09:17

code cleaning

f0435b9

python2.6 and 3.* compatibility

a21e3b5

janrygl force-pushed the develop branch from bcb79ab to a21e3b5 Compare November 3, 2015 09:36

gojomo reviewed Nov 3, 2015
View reviewed changes

piskvorky changed the title ~~Phrases: replace default dict by combination of hyperloglog and CountMinSketch alg.~~ WIP: replace frequency counting using dict by a combination of hyperloglog and CountMinSketch Nov 4, 2015

piskvorky referenced this pull request Nov 21, 2015

prune word2vec vocab automatically if too large

ae243b3

piskvorky mentioned this pull request Dec 6, 2015

Example Cython vocabulary count code (depends on external libraries, may complicate build) #556

Closed

piskvorky changed the title ~~WIP: replace frequency counting using dict by a combination of hyperloglog and CountMinSketch~~ [WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch Dec 6, 2015

tmylk mentioned this pull request Jan 23, 2016

Approx min sketch #270

Closed

piskvorky mentioned this pull request Jun 15, 2017

[WIP] Cythonizing phrases module #1385

Closed

menshikh-iv closed this Jun 26, 2017

piskvorky mentioned this pull request Jul 11, 2017

Fast object counting + Phrases #1446

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch #508

[WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch #508

janrygl commented Nov 3, 2015

piskvorky Nov 3, 2015

gojomo Nov 3, 2015

janrygl Nov 3, 2015

gojomo Nov 3, 2015

piskvorky Nov 4, 2015

piskvorky commented Nov 3, 2015

gojomo commented Nov 3, 2015

piskvorky commented Nov 4, 2015

mfcabrera commented Nov 25, 2015

piskvorky commented Nov 25, 2015

tmylk commented Jan 10, 2016

janrygl commented Jan 11, 2016

piskvorky commented Jan 12, 2016

piskvorky commented Mar 21, 2016

piskvorky commented May 2, 2016 •

edited

Loading

thescopan commented Feb 13, 2017

piskvorky commented Mar 3, 2017 •

edited

Loading

tmylk commented May 2, 2017

piskvorky commented May 14, 2017 •

edited

Loading

tmylk commented May 15, 2017

piskvorky commented May 28, 2017

menshikh-iv commented Jun 13, 2017 •

edited

Loading

piskvorky commented Jun 13, 2017

menshikh-iv commented Jun 13, 2017 •

edited

Loading

piskvorky commented Jun 13, 2017 •

edited

Loading

menshikh-iv commented Jun 26, 2017

[WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch #508

[WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch #508

Conversation

janrygl commented Nov 3, 2015

Changes:

Comparison of old and new implementation:

piskvorky Nov 3, 2015

Choose a reason for hiding this comment

gojomo Nov 3, 2015

Choose a reason for hiding this comment

janrygl Nov 3, 2015

Choose a reason for hiding this comment

gojomo Nov 3, 2015

Choose a reason for hiding this comment

piskvorky Nov 4, 2015

Choose a reason for hiding this comment

piskvorky commented Nov 3, 2015

gojomo commented Nov 3, 2015

piskvorky commented Nov 4, 2015

mfcabrera commented Nov 25, 2015

piskvorky commented Nov 25, 2015

tmylk commented Jan 10, 2016

janrygl commented Jan 11, 2016

piskvorky commented Jan 12, 2016

piskvorky commented Mar 21, 2016

piskvorky commented May 2, 2016 • edited Loading

thescopan commented Feb 13, 2017

piskvorky commented Mar 3, 2017 • edited Loading

tmylk commented May 2, 2017

piskvorky commented May 14, 2017 • edited Loading

tmylk commented May 15, 2017

piskvorky commented May 28, 2017

menshikh-iv commented Jun 13, 2017 • edited Loading

piskvorky commented Jun 13, 2017

menshikh-iv commented Jun 13, 2017 • edited Loading

piskvorky commented Jun 13, 2017 • edited Loading

menshikh-iv commented Jun 26, 2017

piskvorky commented May 2, 2016 •

edited

Loading

piskvorky commented Mar 3, 2017 •

edited

Loading

piskvorky commented May 14, 2017 •

edited

Loading

menshikh-iv commented Jun 13, 2017 •

edited

Loading

menshikh-iv commented Jun 13, 2017 •

edited

Loading

piskvorky commented Jun 13, 2017 •

edited

Loading