Example Cython vocabulary count code (depends on external libraries, may complicate build) #556

honnibal · 2015-12-05T14:58:57Z

This PR passes tests but is not ready to merge. Added as part of discussion in PR #406 and Issue #400

This provides some Cython code to do fast vocabulary counting, with fall-back to the previous Python implementation if the code can't compile. Uses libraries developed for spaCy, but I've avoided depending on spaCy itself.

While this implementation is passing the tests, and preliminary experiments suggest it's much faster, this code comes with some significant caveats. It depends on my MurmurHash wrapper, which means there's this problem of finding the C header files to compile the code. This is causing us a lot of installation headaches for spaCy. It might be that your fall-back strategy avoids the problem. But, maybe not. Beware :).

…nder-document Cython modules, preshed and MurmurHash

… Compile Cython module in setup.py

…ds. Currently the's a likely bug in the counts in the same of mixed unicode and string data, as the two will receive different keys.

honnibal · 2015-12-05T18:51:35Z

Ran some benchmarks, using the POS tagged and chunked Reddit data I'm using for some "sense2vec" experiments. These run a min_count of 5, and the data is atypical — the word distribution is far sparser than normal, because the vectors are over entities, chunks, and pos-tagged words.

Sentences	Original	Cython
10^6	6s	3s
10^7	90s	28s
10^8	3540s	301s

The 10^8 dictionary was occupying 50-60% of the system's memory, so about 8-10gb of data. I don't think the system was swapping. My experience is that past a certain size, the Python dictionary starts to perform really badly. I've never looked at the code to try to figure out why.

The Cython code only remembers the strings for the words that cross the minimum count threshold, so memory requirements are much smaller. The extent of that will depend on the minimum threshold selected, and the data being processed.

I would normally run something like this with simple multi-processing. I like the little wrapper provided by joblib for this. Example: https://github.com/honnibal/spaCy/blob/master/examples/pos_tag.py

piskvorky · 2015-12-06T03:10:41Z

This is great, thanks @honnibal !

Not sure why the large Python dict is so disproportionately slower, I've never observed/noticed that before.

Regarding multiprocessing -- how would you do this? It looks like the counting is way too fast to benefit from coarse-grained message passing between processes. What kind of inter-process communication do you have in mind?

@janrygl Also relevant to our new "counter abstraction API" (hyperloglog, MinCountSketch) in #508.

piskvorky · 2015-12-06T03:20:40Z

@cscorley @gojomo We should also decide how to go about our "pure Python" restriction going forward.

It used to be that everything in gensim was pure Python, using NumPy where performance was critical. This is no longer true and we offer alternative, optional C code paths in word2vec and doc2vec (and LDA soon).

Advantages of requiring a mandatory compiler:

simpler code, drop alternative code paths
more freedom to optimize
drop compiled dependencies that are only there for performance (looking at scipy in particular, Get rid of scipy #557 )

Disadvantages:

gensim harder to install for non-power users
"for humans" motto eroded
slippery slope -- may end up with a pure-C package instead :)

I probably forgot other important points; feel free to add / extend.

honnibal · 2015-12-06T12:00:43Z

Please benchmark this on your own machines and data. As I said, I only ran this on my "nlp2vec" experiments, where the vocab is super sparse. So, your mileage may vary.

My take on the Python vs Cython thing is, if you're depending on numpy your users need either conda, wheel install, or C compiler anyway. Otherwise how are they using numpy? Most of numpy is written in C and Cython. So already you have tonnes of compiled code in your library :)

Personally I'd also rather fail explicitly if the library is not installed correctly. If Gensim somehow fails to use the compiled path, the word2vec and doc2vec code is actually non-functional for most use-cases.

honnibal · 2015-12-06T12:04:02Z

Re multi-processing: send the data to the workers in large batches, like a batch-size of 100k or 1m or whatever. I use toolz.partition for this. You can see this in the POS tagging example: https://github.com/honnibal/spaCy/blob/master/examples/pos_tag.py

piskvorky · 2015-12-06T12:46:38Z

NumPy is everywhere, that's no issue :)

Re. chunking -- sure, but I suspect the communication overhead (pickling, unpickling) would be greater than the actual work done then (just incrementing counts). Worth checking though.

cscorley · 2015-12-06T19:19:18Z

I starting to think that the advantages that Cython gives are worth dropping pure Python paths (maybe this warrants opening a separate issue thread?)

First, as @honnibal mentioned already, to install numpy from scratch requires compilers for both C and Fortran. That's already one more compiler than Cython needs 😉 Both packages are available through Anaconda and the like, and also various Linux distributions. I'm not sure users are losing out much here. A question here is how is Anaconda currently handling gensim with/without Cython?

Second, maintaining two code paths that do the same thing is work (shoutout to @gojomo). Related, triaging bug reports between actual bugs and simply not having Cython working correctly is also work.

Third, the "for humans" part should be expressed by flexible APIs around this code. Humans shouldn't see any of these details, whether it is in Cython shouldn't matter past day 0. For instance, something that needs work is the difference between using Similarity for building an index with LSI/LDA vs how most_similar works in Doc2vec. Also, making it clear/obvious as to why inference can lead to slightly different results is also a common pain-point. These are just a couple of (more pressing) examples where "for humans" isn't working well.

Finally, one point against Cython: it will discourage newcomers from contributing bug fixes and features. This is one reason why I started using this package; to be able to read and modify code as I needed since it was in Python. Cython will be another step up for a lot of people.

gojomo · 2015-12-07T01:40:20Z

The big slowdown in the 10^8 sentences case is suspicious to me as well; it may indicate something that can be fixed in pure-python.

For a multiprocessing speed win, we could also work with text sources that can provide ranges to be read from each process independently. (Either: split over many files, or a single-file in a format that allows random-seeks to aligned records.) Then each process is assigned files/ranges (rather than fed giant chunks of text from a bottleneck master reading process), does its own open/seek/read, and only reports back its full tallies at the end, for a final merge. That could allow nearly-linear speed improvement with the number of processors (memory permitting).

There are some rare cases where @honnibal's map-of-hashes technique could result in a mis-count – 64-bit hash collisions. (Barring intentionally-crafted collisions, you'd only expect a first collision when going over 5 billion tokens.) In practice, no one might care, but my preference would be to always offer a precise count as a prominent option (and perhaps the default), and make it clear when any approximation/optimization for speed (or compactness) might be used instead.

Performance can be very critical for this problem domain, for usual datasets – so I think there's no escaping some uses of closer-to-native code... especially when it offers 10x+ speedups. In all cases keeping APIs as 'pythonic' as possible still serves the main goal. (Maybe at some point the cython could be further minimized, retaining similar performance, with tools like Numba? I've seen nice demos but haven't yet tried it...)

piskvorky · 2015-12-07T02:21:12Z

I think this new paradigm, using multiple stream readers rather than a single one, could be useful. That scenario is fairly common in practice.

Question is, how do we wrap that in a sane API? Start offering corpora ctr parameter, instead of the current corpus (sentences, etc)? For example lda = LdaModel(corpora=[corpus1, corpus2, corpus2], num_topics=100)?

We're getting slightly off-topic, but it seems a major API facelift in gensim is in order :)

syllog1sm · 2015-12-22T14:45:23Z

There are some rare cases where @honnibal's map-of-hashes technique could result in a mis-count – 64-bit hash collisions. (Barring intentionally-crafted collisions, you'd only expect a first collision when going over 5 billion tokens.) In practice, no one might care, but my preference would be to always offer a precise count as a prominent option (and perhaps the default), and make it clear when any approximation/optimization for speed (or compactness) might be used instead.

Well, everything about the operation of this algorithm is approximation :). I mean, even for these counts,
what we're using them for is to create a frequency estimate. It would make sense to smooth the counts, but we don't, because it doesn't matter — we just use the literal ones.

So, I think it would actually be misleading and weird to talk about this source of imprecision, or worse, to surface it to the user as a choice. Why should the user have to think through the logic of the birthday paradox, check your estimate of the hash collision probability, and then see that indeed the hash collision probability totally doesn't matter?

tmylk · 2016-01-10T08:27:17Z

@piskvorky Specifically about this PR - is it ready to merge? Are we taking on the new dependencies?

piskvorky · 2016-01-11T05:01:52Z

Do not merge yet; I owe Matthew a more thorough API plan and roadmap as discussed. Didn't get to that yet.

honnibal · 2016-01-26T16:44:52Z

Currently revisiting this.

It's possible to do much better than the Cython code above, but only by accepting a much narrower API. I have a version now that seems to be about 5x faster. It seems to be disk bound at this point.

Specifically if you require the data to be passed in as a byte-string with space/newline delimited tokens, the program can be very fast. But if users can pass in tokens as a sequence of strings, I think performance will always be sort of bad.

I think it's often hard to add performance back in after you've committed to a certain (Python) API. For the use-cases I'm interested in, performance is super important. I'm often telling people that doing NLP is like computing in a world where Moore's law never happened: job sizes outpace improvements in computational resources.

So, for my own use of these things, I've found myself wanting much more performance oriented, and much more restrictive, versions of some of the Gensim functionality. The question is whether these should be merged into Gensim, or whether it's more natural to break these out into another project.

honnibal · 2016-01-26T17:45:43Z

(btw the closing this was a misclick =/)

piskvorky · 2016-01-27T02:33:56Z

No worries :)

Regarding non-Python, constrained APIs: definitely not the path we want to take with gensim, generally speaking.

I mean, we could simulate the "raw string" approach by concatenating all tokens into a single string, then passing that on, rather trivially. But I suspect that would negate any performance gains again, right :)

Or, we could offer both options. Python-friendly and flexible for normal users and use-cases, and this "input comes in very restricted C-like form" for power users.

Re. stand alone word-counting library: I like that idea. We're also working on algorithmic improvements in parallel to such low-level optimizations (approximate counts, parallelization; #508), and I think this particular task is universal enough, and hard enough, to be useful as a standalone lib.

My only constraint there is that the lib is sane to install and maintain across platforms, so we can add it as a dependency from gensim.

menshikh-iv · 2017-06-13T08:34:36Z

Ping @honnibal, what status of this PR? Will you finish it soon?

piskvorky · 2017-06-13T08:56:05Z

@menshikh-iv this is one of our GSoC projects this year, I believe.

I'm in favour of making this "fast, parallelized, memory efficient counting" functionality a separate package, rather than a subpackage of gensim.

menshikh-iv · 2017-06-26T08:47:14Z

Connected with #1446

syllog1sm added 5 commits December 5, 2015 15:42

* Add Cython implementation to quickly build vocabulary. Depends on u…

7f8c44c

…nder-document Cython modules, preshed and MurmurHash

* Use count_words_fast if available and if no trim_rule is specified.…

848213c

… Compile Cython module in setup.py

* Fix dumb errors

72cbe26

* Add a _hash_bytes method, to compute hash keys for string-typed wor…

bad0f4a

…ds. Currently the's a likely bug in the counts in the same of mixed unicode and string data, as the two will receive different keys.

* Fix dumb name mistake which prevented fast counts from being used.

d2e86cc

honnibal closed this Jan 26, 2016

honnibal reopened this Jan 26, 2016

piskvorky mentioned this pull request Jun 15, 2017

[WIP] Cythonizing phrases module #1385

Closed

menshikh-iv closed this Jun 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example Cython vocabulary count code (depends on external libraries, may complicate build) #556

Example Cython vocabulary count code (depends on external libraries, may complicate build) #556

honnibal commented Dec 5, 2015

honnibal commented Dec 5, 2015

piskvorky commented Dec 6, 2015

piskvorky commented Dec 6, 2015

honnibal commented Dec 6, 2015

honnibal commented Dec 6, 2015

piskvorky commented Dec 6, 2015

cscorley commented Dec 6, 2015

gojomo commented Dec 7, 2015

piskvorky commented Dec 7, 2015

syllog1sm commented Dec 22, 2015

tmylk commented Jan 10, 2016

piskvorky commented Jan 11, 2016

honnibal commented Jan 26, 2016

honnibal commented Jan 26, 2016

piskvorky commented Jan 27, 2016

menshikh-iv commented Jun 13, 2017

piskvorky commented Jun 13, 2017

menshikh-iv commented Jun 26, 2017

Example Cython vocabulary count code (depends on external libraries, may complicate build) #556

Example Cython vocabulary count code (depends on external libraries, may complicate build) #556

Conversation

honnibal commented Dec 5, 2015

honnibal commented Dec 5, 2015

piskvorky commented Dec 6, 2015

piskvorky commented Dec 6, 2015

honnibal commented Dec 6, 2015

honnibal commented Dec 6, 2015

piskvorky commented Dec 6, 2015

cscorley commented Dec 6, 2015

gojomo commented Dec 7, 2015

piskvorky commented Dec 7, 2015

syllog1sm commented Dec 22, 2015

tmylk commented Jan 10, 2016

piskvorky commented Jan 11, 2016

honnibal commented Jan 26, 2016

honnibal commented Jan 26, 2016

piskvorky commented Jan 27, 2016

menshikh-iv commented Jun 13, 2017

piskvorky commented Jun 13, 2017

menshikh-iv commented Jun 26, 2017