Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Cythonizing phrases module #1385

Closed
wants to merge 18 commits into from

Conversation

prakhar2b
Copy link
Contributor

@prakhar2b prakhar2b commented Jun 2, 2017

Performance Improvement in Phrases module.
For context - link to my gsoc live blog

Phrases optimization benchmark (for text8)

Optimization Python 2.7 Python 3.6 PR
original ~ 36-38 sec ~32-35 sec
cython (static typing) ~30-32 sec
any2utf8 (without cython) ~20-22 sec ~23-26 sec #1413
cython (with any2utf8) ~15-18 sec ~19-21 sec This PR

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's still early days, but I'm excited to see these developments! Will there be a notebook documenting the gradual progress? (speed of pure Python / naive Cython / optimized Cython / parallelized Cython..., on some substantial dataset, such as the English Wiki)

utils.prune_vocab(vocab, min_reduce)
min_reduce += 1

logger.info("collected %i word types from a corpus of %i words (unigram + bigrams) and %i sentences" %
Copy link
Owner

@piskvorky piskvorky Jun 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: hanging indent, not vertical indent (good practice to use hanging indent from the start, to minimize necessary fixes later).

Also, slightly better to pass logger.info arguments as arguments, instead of formatting the string directly with %.

@@ -0,0 +1,63 @@
#!/usr/bin/env cython
# cython: boundscheck=False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better leave as True, during debugging.

logger.info("Cython file loaded")
except ImportError:
logger.info("Cython file not loaded")
#failed... fall back to plain numpy (20-80x slower training than the above)
Copy link
Owner

@piskvorky piskvorky Jun 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 20-80x figure correct? If not, better remove the stale comments and start the development with a clean slate.

@prakhar2b
Copy link
Contributor Author

@piskvorky Thanks for making comments here. yes, I'll document the progress as I go along. So far, I've been working on text8 corpus, I'll test it for English wiki too for documentation.

Also, I've been researching a lot about optimization and currently looking to cythonize and parallelize the code. It would be great to have some advise from you regarding optimizing (specifically, for phrases module) considering your experience with word2vec optimization. 😄

if sentence_no % progress_per == 0:
logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" %
(sentence_no, total_words, len(vocab)))
#sentence = [utils.any2utf8(w) for w in sentence]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky was there any particular reason behind creating the vocab for Phrases with utf8-encoded bytestrings, rather than unicode strings themselves?
Currently, according to profiling done by Prakhar, the utf8 conversion significantly affects performance due to overhead in the conversion and the fact that the conversion is done for every individual word.

Copy link
Owner

@piskvorky piskvorky Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, saving memory.

Up to Python 3.3 (and including all Python 2.x), unicode strings take up 2-4x as much memory, compared to UTF8 byte strings, for normal text.

Since memory is more critical than speed here, we went with UTF8.

@piskvorky
Copy link
Owner

piskvorky commented Jun 15, 2017

@prakhar2b sure, I'll be happy to help. I assume Step 1, the most important and most difficult part, is the "item counting" functionality. Building phrases on top of that should be trivial -- just pass in the right input & calculate the right stats from its output.

Before any Cythonization or parallelization, decide on the architecture of the module -- what kind of inputs it will accept (probably any iterable of hashable items), what algorithms, what data structures. This is a critical piece of functionality, so we want something super fast and memory-efficient, see #508, #556.

Then implement that in pure Python for reference.

Then profile and rewrite hotspots in Cython, for performance.

Finally, see if parallelization makes sense -- probably something more coarse grained, since updating counters is such a tiny operation, there's no point parallelizing that, the overhead would be too big (cannot afford any data movement or marshalling). Maybe accept multiple iterables on input and update the counter on all in parallel? (the use-case being, people will supply readers from multiple files at once, the algo will consume them in parallel).

@jayantj
Copy link
Contributor

jayantj commented Jun 22, 2017

Please add the benchmark notebook we discussed to this PR - along with clear labels about which version of commit (commit hash) was used to create which benchmark.
It would also be good to have a relevant PR description.

self.min_count = min_count
self.threshold = threshold
self.max_vocab_size = max_vocab_size
self.vocab = defaultdict(int) # mapping between utf8 token => its count
self.min_reduce = 1 # ignore any tokens with count smaller than this
self.delimiter = delimiter
self.delimiter = delimiter if recode_to_utf8 else utils.any2unicode(delimiter)
Copy link
Owner

@piskvorky piskvorky Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? Why is delimiter unicode if we're not recoding (using bytestrings)?

Deserves a comment at least.

try:
from gensim.models.phrases_inner import learn_vocab
except ImportError:
logger.info("failed to load cython")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we're not loading cython -- needs a better message (also more descriptive).

@menshikh-iv
Copy link
Contributor

I close this PR because this code doesn't give needed performance (expected minimum x10 through cython but give only x2.5 through any2utf hack)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants