-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Cythonizing phrases module #1385
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's still early days, but I'm excited to see these developments! Will there be a notebook documenting the gradual progress? (speed of pure Python / naive Cython / optimized Cython / parallelized Cython..., on some substantial dataset, such as the English Wiki)
utils.prune_vocab(vocab, min_reduce) | ||
min_reduce += 1 | ||
|
||
logger.info("collected %i word types from a corpus of %i words (unigram + bigrams) and %i sentences" % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: hanging indent, not vertical indent (good practice to use hanging indent from the start, to minimize necessary fixes later).
Also, slightly better to pass logger.info
arguments as arguments, instead of formatting the string directly with %
.
@@ -0,0 +1,63 @@ | |||
#!/usr/bin/env cython | |||
# cython: boundscheck=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better leave as True
, during debugging.
gensim/models/phrases.py
Outdated
logger.info("Cython file loaded") | ||
except ImportError: | ||
logger.info("Cython file not loaded") | ||
#failed... fall back to plain numpy (20-80x slower training than the above) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 20-80x figure correct? If not, better remove the stale comments and start the development with a clean slate.
@piskvorky Thanks for making comments here. yes, I'll document the progress as I go along. So far, I've been working on Also, I've been researching a lot about optimization and currently looking to cythonize and parallelize the code. It would be great to have some advise from you regarding optimizing (specifically, for phrases module) considering your experience with word2vec optimization. 😄 |
gensim/models/phrases.py
Outdated
if sentence_no % progress_per == 0: | ||
logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" % | ||
(sentence_no, total_words, len(vocab))) | ||
#sentence = [utils.any2utf8(w) for w in sentence] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky was there any particular reason behind creating the vocab for Phrases
with utf8-encoded bytestrings, rather than unicode strings themselves?
Currently, according to profiling done by Prakhar, the utf8 conversion significantly affects performance due to overhead in the conversion and the fact that the conversion is done for every individual word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, saving memory.
Up to Python 3.3 (and including all Python 2.x), unicode strings take up 2-4x as much memory, compared to UTF8 byte strings, for normal text.
Since memory is more critical than speed here, we went with UTF8.
@prakhar2b sure, I'll be happy to help. I assume Step 1, the most important and most difficult part, is the "item counting" functionality. Building phrases on top of that should be trivial -- just pass in the right input & calculate the right stats from its output. Before any Cythonization or parallelization, decide on the architecture of the module -- what kind of inputs it will accept (probably any iterable of hashable items), what algorithms, what data structures. This is a critical piece of functionality, so we want something super fast and memory-efficient, see #508, #556. Then implement that in pure Python for reference. Then profile and rewrite hotspots in Cython, for performance. Finally, see if parallelization makes sense -- probably something more coarse grained, since updating counters is such a tiny operation, there's no point parallelizing that, the overhead would be too big (cannot afford any data movement or marshalling). Maybe accept multiple iterables on input and update the counter on all in parallel? (the use-case being, people will supply readers from multiple files at once, the algo will consume them in parallel). |
Please add the benchmark notebook we discussed to this PR - along with clear labels about which version of commit (commit hash) was used to create which benchmark. |
self.min_count = min_count | ||
self.threshold = threshold | ||
self.max_vocab_size = max_vocab_size | ||
self.vocab = defaultdict(int) # mapping between utf8 token => its count | ||
self.min_reduce = 1 # ignore any tokens with count smaller than this | ||
self.delimiter = delimiter | ||
self.delimiter = delimiter if recode_to_utf8 else utils.any2unicode(delimiter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? Why is delimiter
unicode if we're not recoding (using bytestrings)?
Deserves a comment at least.
try: | ||
from gensim.models.phrases_inner import learn_vocab | ||
except ImportError: | ||
logger.info("failed to load cython") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we're not loading cython -- needs a better message (also more descriptive).
I close this PR because this code doesn't give needed performance (expected minimum x10 through cython but give only x2.5 through any2utf hack) |
Performance Improvement in Phrases module.
For context - link to my gsoc live blog
Phrases optimization benchmark (for text8)