[WIP] Cythonizing phrases module #1385

prakhar2b · 2017-06-02T08:17:34Z

Performance Improvement in Phrases module.
For context - link to my gsoc live blog

Phrases optimization benchmark (for text8)

Optimization	Python 2.7	Python 3.6	PR
original	~ 36-38 sec	~32-35 sec
cython (static typing)	~30-32 sec
any2utf8 (without cython)	~20-22 sec	~23-26 sec	#1413
cython (with any2utf8)	~15-18 sec	~19-21 sec	This PR

…into gsoc17_phrases

piskvorky

I know it's still early days, but I'm excited to see these developments! Will there be a notebook documenting the gradual progress? (speed of pure Python / naive Cython / optimized Cython / parallelized Cython..., on some substantial dataset, such as the English Wiki)

piskvorky · 2017-06-03T07:52:05Z

gensim/models/phrases_inner.pyx

+            utils.prune_vocab(vocab, min_reduce)
+            min_reduce += 1
+
+    logger.info("collected %i word types from a corpus of %i words (unigram + bigrams) and %i sentences" %


Code style: hanging indent, not vertical indent (good practice to use hanging indent from the start, to minimize necessary fixes later).

Also, slightly better to pass logger.info arguments as arguments, instead of formatting the string directly with %.

piskvorky · 2017-06-03T08:24:02Z

gensim/models/phrases_inner.pyx

@@ -0,0 +1,63 @@
+#!/usr/bin/env cython
+# cython: boundscheck=False


Better leave as True, during debugging.

piskvorky · 2017-06-03T08:26:16Z

gensim/models/phrases.py

+        logger.info("Cython file loaded")
+    except ImportError:
+        logger.info("Cython file not loaded")
+        #failed... fall back to plain numpy (20-80x slower training than the above)


Is the 20-80x figure correct? If not, better remove the stale comments and start the development with a clean slate.

prakhar2b · 2017-06-05T07:05:59Z

@piskvorky Thanks for making comments here. yes, I'll document the progress as I go along. So far, I've been working on text8 corpus, I'll test it for English wiki too for documentation.

Also, I've been researching a lot about optimization and currently looking to cythonize and parallelize the code. It would be great to have some advise from you regarding optimizing (specifically, for phrases module) considering your experience with word2vec optimization. 😄

jayantj · 2017-06-14T15:55:04Z

gensim/models/phrases.py

+                if sentence_no % progress_per == 0:
+                    logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" %
+                                (sentence_no, total_words, len(vocab)))
+                #sentence = [utils.any2utf8(w) for w in sentence] 


@piskvorky was there any particular reason behind creating the vocab for Phrases with utf8-encoded bytestrings, rather than unicode strings themselves?
Currently, according to profiling done by Prakhar, the utf8 conversion significantly affects performance due to overhead in the conversion and the fact that the conversion is done for every individual word.

Yes, saving memory.

Up to Python 3.3 (and including all Python 2.x), unicode strings take up 2-4x as much memory, compared to UTF8 byte strings, for normal text.

Since memory is more critical than speed here, we went with UTF8.

piskvorky · 2017-06-15T00:23:12Z

@prakhar2b sure, I'll be happy to help. I assume Step 1, the most important and most difficult part, is the "item counting" functionality. Building phrases on top of that should be trivial -- just pass in the right input & calculate the right stats from its output.

Before any Cythonization or parallelization, decide on the architecture of the module -- what kind of inputs it will accept (probably any iterable of hashable items), what algorithms, what data structures. This is a critical piece of functionality, so we want something super fast and memory-efficient, see #508, #556.

Then implement that in pure Python for reference.

Then profile and rewrite hotspots in Cython, for performance.

Finally, see if parallelization makes sense -- probably something more coarse grained, since updating counters is such a tiny operation, there's no point parallelizing that, the overhead would be too big (cannot afford any data movement or marshalling). Maybe accept multiple iterables on input and update the counter on all in parallel? (the use-case being, people will supply readers from multiple files at once, the algo will consume them in parallel).

jayantj · 2017-06-22T07:42:11Z

Please add the benchmark notebook we discussed to this PR - along with clear labels about which version of commit (commit hash) was used to create which benchmark.
It would also be good to have a relevant PR description.

piskvorky · 2017-06-28T14:37:14Z

gensim/models/phrases.py

        self.min_count = min_count
        self.threshold = threshold
        self.max_vocab_size = max_vocab_size
        self.vocab = defaultdict(int)  # mapping between utf8 token => its count
        self.min_reduce = 1  # ignore any tokens with count smaller than this
-        self.delimiter = delimiter
+        self.delimiter = delimiter if recode_to_utf8 else utils.any2unicode(delimiter)


How does this work? Why is delimiter unicode if we're not recoding (using bytestrings)?

Deserves a comment at least.

piskvorky · 2017-06-28T14:40:14Z

gensim/models/phrases.py

+try:
+    from gensim.models.phrases_inner import learn_vocab
+except ImportError:
+    logger.info("failed to load cython")


But we're not loading cython -- needs a better message (also more descriptive).

menshikh-iv · 2017-06-29T10:57:42Z

I close this PR because this code doesn't give needed performance (expected minimum x10 through cython but give only x2.5 through any2utf hack)

prakhar2b and others added 6 commits May 30, 2017 13:07

initialized

73f33c9

initialized

22ce838

cythonizing phrases learn_vocab

08eb4ae

cythonizing learn_vocab

ebf39d1

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

1c17b6c

…into gsoc17_phrases

[WIP] update

7e456fa

piskvorky requested changes Jun 3, 2017

View reviewed changes

Prakhar Pratyush and others added 2 commits June 5, 2017 12:26

added .c to flake ignore list

bec2baf

Merge branch 'develop' into gsoc17_phrases

48f8b9a

jayantj reviewed Jun 14, 2017

View reviewed changes

[WIP] cythonizing export_phrases

1cfe39d

prakhar2b mentioned this pull request Jun 18, 2017

[WIP] Phrases: make any2utf8 optional #1413

Closed

Prakhar Pratyush added 2 commits June 20, 2017 12:51

cythonize learn_vocab

8316c1a

pep8 fixes

a3b7e5f

Prakhar Pratyush added 7 commits June 26, 2017 14:00

any2utf in cython

dbee0c7

typo error resolved

024442f

cimport numpy now in pxd file not in pyx file

3bab54e

add pyx/pxd to flake8 ignore list

4d636ae

add class in pyx file

c164173

updated a/c to recent utf8 (optional)

154fcfa

update a/c any2utf8 optional

f7de6f0

piskvorky reviewed Jun 28, 2017

View reviewed changes

piskvorky requested changes Jun 28, 2017

View reviewed changes

menshikh-iv closed this Jun 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Cythonizing phrases module #1385

[WIP] Cythonizing phrases module #1385

prakhar2b commented Jun 2, 2017 •

edited

Loading

piskvorky left a comment •

edited

Loading

piskvorky Jun 3, 2017 •

edited

Loading

piskvorky Jun 3, 2017

piskvorky Jun 3, 2017 •

edited

Loading

prakhar2b commented Jun 5, 2017

jayantj Jun 14, 2017

piskvorky Jun 15, 2017 •

edited

Loading

piskvorky commented Jun 15, 2017 •

edited

Loading

jayantj commented Jun 22, 2017 •

edited

Loading

piskvorky Jun 28, 2017 •

edited

Loading

piskvorky Jun 28, 2017

menshikh-iv commented Jun 29, 2017

		@@ -0,0 +1,63 @@
		#!/usr/bin/env cython
		# cython: boundscheck=False

[WIP] Cythonizing phrases module #1385

[WIP] Cythonizing phrases module #1385

Conversation

prakhar2b commented Jun 2, 2017 • edited Loading

piskvorky left a comment • edited Loading

Choose a reason for hiding this comment

piskvorky Jun 3, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Jun 3, 2017

Choose a reason for hiding this comment

piskvorky Jun 3, 2017 • edited Loading

Choose a reason for hiding this comment

prakhar2b commented Jun 5, 2017

jayantj Jun 14, 2017

Choose a reason for hiding this comment

piskvorky Jun 15, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky commented Jun 15, 2017 • edited Loading

jayantj commented Jun 22, 2017 • edited Loading

piskvorky Jun 28, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Jun 28, 2017

Choose a reason for hiding this comment

menshikh-iv commented Jun 29, 2017

prakhar2b commented Jun 2, 2017 •

edited

Loading

piskvorky left a comment •

edited

Loading

piskvorky Jun 3, 2017 •

edited

Loading

piskvorky Jun 3, 2017 •

edited

Loading

piskvorky Jun 15, 2017 •

edited

Loading

piskvorky commented Jun 15, 2017 •

edited

Loading

jayantj commented Jun 22, 2017 •

edited

Loading

piskvorky Jun 28, 2017 •

edited

Loading