-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example Cython vocabulary count code (depends on external libraries, may complicate build) #556
Conversation
…nder-document Cython modules, preshed and MurmurHash
… Compile Cython module in setup.py
…ds. Currently the's a likely bug in the counts in the same of mixed unicode and string data, as the two will receive different keys.
Ran some benchmarks, using the POS tagged and chunked Reddit data I'm using for some "sense2vec" experiments. These run a
The 10^8 dictionary was occupying 50-60% of the system's memory, so about 8-10gb of data. I don't think the system was swapping. My experience is that past a certain size, the Python dictionary starts to perform really badly. I've never looked at the code to try to figure out why. The Cython code only remembers the strings for the words that cross the minimum count threshold, so memory requirements are much smaller. The extent of that will depend on the minimum threshold selected, and the data being processed. I would normally run something like this with simple multi-processing. I like the little wrapper provided by |
This is great, thanks @honnibal ! Not sure why the large Python dict is so disproportionately slower, I've never observed/noticed that before. Regarding multiprocessing -- how would you do this? It looks like the counting is way too fast to benefit from coarse-grained message passing between processes. What kind of inter-process communication do you have in mind? @janrygl Also relevant to our new "counter abstraction API" (hyperloglog, MinCountSketch) in #508. |
@cscorley @gojomo We should also decide how to go about our "pure Python" restriction going forward. It used to be that everything in gensim was pure Python, using NumPy where performance was critical. This is no longer true and we offer alternative, optional C code paths in word2vec and doc2vec (and LDA soon). Advantages of requiring a mandatory compiler:
Disadvantages:
I probably forgot other important points; feel free to add / extend. |
Please benchmark this on your own machines and data. As I said, I only ran this on my "nlp2vec" experiments, where the vocab is super sparse. So, your mileage may vary. My take on the Python vs Cython thing is, if you're depending on numpy your users need either conda, wheel install, or C compiler anyway. Otherwise how are they using numpy? Most of numpy is written in C and Cython. So already you have tonnes of compiled code in your library :) Personally I'd also rather fail explicitly if the library is not installed correctly. If Gensim somehow fails to use the compiled path, the word2vec and doc2vec code is actually non-functional for most use-cases. |
Re multi-processing: send the data to the workers in large batches, like a batch-size of 100k or 1m or whatever. I use |
NumPy is everywhere, that's no issue :) Re. chunking -- sure, but I suspect the communication overhead (pickling, unpickling) would be greater than the actual work done then (just incrementing counts). Worth checking though. |
I starting to think that the advantages that Cython gives are worth dropping pure Python paths (maybe this warrants opening a separate issue thread?) First, as @honnibal mentioned already, to install numpy from scratch requires compilers for both C and Fortran. That's already one more compiler than Cython needs 😉 Both packages are available through Anaconda and the like, and also various Linux distributions. I'm not sure users are losing out much here. A question here is how is Anaconda currently handling gensim with/without Cython? Second, maintaining two code paths that do the same thing is work (shoutout to @gojomo). Related, triaging bug reports between actual bugs and simply not having Cython working correctly is also work. Third, the "for humans" part should be expressed by flexible APIs around this code. Humans shouldn't see any of these details, whether it is in Cython shouldn't matter past day 0. For instance, something that needs work is the difference between using Similarity for building an index with LSI/LDA vs how most_similar works in Doc2vec. Also, making it clear/obvious as to why inference can lead to slightly different results is also a common pain-point. These are just a couple of (more pressing) examples where "for humans" isn't working well. Finally, one point against Cython: it will discourage newcomers from contributing bug fixes and features. This is one reason why I started using this package; to be able to read and modify code as I needed since it was in Python. Cython will be another step up for a lot of people. |
The big slowdown in the 10^8 sentences case is suspicious to me as well; it may indicate something that can be fixed in pure-python. For a multiprocessing speed win, we could also work with text sources that can provide ranges to be read from each process independently. (Either: split over many files, or a single-file in a format that allows random-seeks to aligned records.) Then each process is assigned files/ranges (rather than fed giant chunks of text from a bottleneck master reading process), does its own open/seek/read, and only reports back its full tallies at the end, for a final merge. That could allow nearly-linear speed improvement with the number of processors (memory permitting). There are some rare cases where @honnibal's map-of-hashes technique could result in a mis-count – 64-bit hash collisions. (Barring intentionally-crafted collisions, you'd only expect a first collision when going over 5 billion tokens.) In practice, no one might care, but my preference would be to always offer a precise count as a prominent option (and perhaps the default), and make it clear when any approximation/optimization for speed (or compactness) might be used instead. Performance can be very critical for this problem domain, for usual datasets – so I think there's no escaping some uses of closer-to-native code... especially when it offers 10x+ speedups. In all cases keeping APIs as 'pythonic' as possible still serves the main goal. (Maybe at some point the cython could be further minimized, retaining similar performance, with tools like Numba? I've seen nice demos but haven't yet tried it...) |
I think this new paradigm, using multiple stream readers rather than a single one, could be useful. That scenario is fairly common in practice. Question is, how do we wrap that in a sane API? Start offering We're getting slightly off-topic, but it seems a major API facelift in gensim is in order :) |
Well, everything about the operation of this algorithm is approximation :). I mean, even for these counts, So, I think it would actually be misleading and weird to talk about this source of imprecision, or worse, to surface it to the user as a choice. Why should the user have to think through the logic of the birthday paradox, check your estimate of the hash collision probability, and then see that indeed the hash collision probability totally doesn't matter? |
@piskvorky Specifically about this PR - is it ready to merge? Are we taking on the new dependencies? |
Do not merge yet; I owe Matthew a more thorough API plan and roadmap as discussed. Didn't get to that yet. |
Currently revisiting this. It's possible to do much better than the Cython code above, but only by accepting a much narrower API. I have a version now that seems to be about 5x faster. It seems to be disk bound at this point. Specifically if you require the data to be passed in as a byte-string with space/newline delimited tokens, the program can be very fast. But if users can pass in tokens as a sequence of strings, I think performance will always be sort of bad. I think it's often hard to add performance back in after you've committed to a certain (Python) API. For the use-cases I'm interested in, performance is super important. I'm often telling people that doing NLP is like computing in a world where Moore's law never happened: job sizes outpace improvements in computational resources. So, for my own use of these things, I've found myself wanting much more performance oriented, and much more restrictive, versions of some of the Gensim functionality. The question is whether these should be merged into Gensim, or whether it's more natural to break these out into another project. |
(btw the closing this was a misclick =/) |
No worries :) Regarding non-Python, constrained APIs: definitely not the path we want to take with gensim, generally speaking. I mean, we could simulate the "raw string" approach by concatenating all tokens into a single string, then passing that on, rather trivially. But I suspect that would negate any performance gains again, right :) Or, we could offer both options. Python-friendly and flexible for normal users and use-cases, and this "input comes in very restricted C-like form" for power users. Re. stand alone word-counting library: I like that idea. We're also working on algorithmic improvements in parallel to such low-level optimizations (approximate counts, parallelization; #508), and I think this particular task is universal enough, and hard enough, to be useful as a standalone lib. My only constraint there is that the lib is sane to install and maintain across platforms, so we can add it as a dependency from gensim. |
Ping @honnibal, what status of this PR? Will you finish it soon? |
@menshikh-iv this is one of our GSoC projects this year, I believe. I'm in favour of making this "fast, parallelized, memory efficient counting" functionality a separate package, rather than a subpackage of gensim. |
Connected with #1446 |
This PR passes tests but is not ready to merge. Added as part of discussion in PR #406 and Issue #400
This provides some Cython code to do fast vocabulary counting, with fall-back to the previous Python implementation if the code can't compile. Uses libraries developed for spaCy, but I've avoided depending on spaCy itself.
While this implementation is passing the tests, and preliminary experiments suggest it's much faster, this code comes with some significant caveats. It depends on my MurmurHash wrapper, which means there's this problem of finding the C header files to compile the code. This is causing us a lot of installation headaches for spaCy. It might be that your fall-back strategy avoids the problem. But, maybe not. Beware :).