Fix method estimate_memory from gensim.models.FastText & huge per…

…formance improvement. Fix #1824 (#1916) * Cythonize fasttext.ft_hash for 100x performance improvement * Cythonize fasttext.compute_ngrams for 2x performance improvement * Reduce fasttext memory usage by computing ngrams on the fly * Fix compute_ngrams for Python 2 * Store OOV vec in variable for more informative assertion error in testPersistenceForOldVersions * Revert all changes to fasttext_wrapper * Fix indentation for multi-line expressions * Rename utils_any2vec_fast to _utils_any2vec * fasttext: Cache ngram buckets for words during training This removes the expensive calls to `compute_ngrams` and `ft_hash` during training and uses a simple lookup in an int -> int[] mapping instead, resulting in a dramatic increase in training performance. * Remove last occurences of wv.ngrams_word and wv.ngrams * fasttext: use buckets_word cache also for non-Cython training * fasttext: Add buckets_ngram size to memory estimate * fasttext: Don't store buckets_word with the model * fasttext: Use smaller model for test_estimate_memory * fasttext: Fix pure python training code * fasttext: Fix asserts for test_estimate_memory * fasttext: Fix typo and style errors * fasttext: Simplify code as per @jayantj's review * Update MANIFEST.in and documentation with utils_any2vec implementations * last fixes (add option for cython compiler, fix descriptions, etc)
piskvorky · Mar 1, 2018 · 9021ea8 · 9021ea8
1 parent b000b4f
commit 9021ea8
Show file tree

Hide file tree

Showing 14 changed files with 4,829 additions and 1,362 deletions.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -12,6 +12,8 @@ include gensim/models/doc2vec_inner.c
 include gensim/models/doc2vec_inner.pyx
 include gensim/models/fasttext_inner.c
 include gensim/models/fasttext_inner.pyx
+include gensim/models/_utils_any2vec.c
+include gensim/models/_utils_any2vec.pyx
 include gensim/corpora/_mmreader.c
 include gensim/corpora/_mmreader.pyx
 include gensim/_matutils.c

diff --git a/docs/src/apiref.rst b/docs/src/apiref.rst
@@ -51,6 +51,8 @@ Modules:
     models/coherencemodel
     models/basemodel
     models/callbacks
+    models/utils_any2vec
+    models/_utils_any2vec
     models/wrappers/ldamallet
     models/wrappers/dtmmodel
     models/wrappers/ldavowpalwabbit.rst

diff --git a/docs/src/models/_utils_any2vec.rst b/docs/src/models/_utils_any2vec.rst
@@ -0,0 +1,9 @@
+:mod:`models._utils_any2vec` -- Cython utils for any2vec models
+===============================================================
+
+.. automodule:: gensim.models._utils_any2vec
+    :synopsis: Cython utils for any2vec models
+    :members:
+    :inherited-members:
+    :undoc-members:
+    :show-inheritance:
diff --git a/docs/src/models/utils_any2vec.rst b/docs/src/models/utils_any2vec.rst
@@ -0,0 +1,9 @@
+:mod:`models.utils_any2vec` -- Utils for any2vec models
+=======================================================
+
+.. automodule:: gensim.models.utils_any2vec
+    :synopsis: Utils for any2vec models
+    :members:
+    :inherited-members:
+    :undoc-members:
+    :show-inheritance: