Skip to content

Commit

Permalink
Fix method estimate_memory from gensim.models.FastText & huge per…
Browse files Browse the repository at this point in the history
…formance improvement. Fix #1824 (#1916)

* Cythonize fasttext.ft_hash for 100x performance improvement

* Cythonize fasttext.compute_ngrams for 2x performance improvement

* Reduce fasttext memory usage by computing ngrams on the fly

* Fix compute_ngrams for Python 2

* Store OOV vec in variable for more informative assertion error in testPersistenceForOldVersions

* Revert all changes to fasttext_wrapper

* Fix indentation for multi-line expressions

* Rename utils_any2vec_fast to _utils_any2vec

* fasttext: Cache ngram buckets for words during training

This removes the expensive calls to `compute_ngrams` and `ft_hash`
during training and uses a simple lookup in an int -> int[] mapping
instead, resulting in a dramatic increase in training performance.

* Remove last occurences of wv.ngrams_word and wv.ngrams

* fasttext: use buckets_word cache also for non-Cython training

* fasttext: Add buckets_ngram size to memory estimate

* fasttext: Don't store buckets_word with the model

* fasttext: Use smaller model for test_estimate_memory

* fasttext: Fix pure python training code

* fasttext: Fix asserts for test_estimate_memory

* fasttext: Fix typo and style errors

* fasttext: Simplify code as per @jayantj's review

* Update MANIFEST.in and documentation with utils_any2vec implementations

* last fixes (add option for cython compiler, fix descriptions, etc)
  • Loading branch information
jbaiter authored and menshikh-iv committed Mar 1, 2018
1 parent b000b4f commit 9021ea8
Show file tree
Hide file tree
Showing 14 changed files with 4,829 additions and 1,362 deletions.
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ include gensim/models/doc2vec_inner.c
include gensim/models/doc2vec_inner.pyx
include gensim/models/fasttext_inner.c
include gensim/models/fasttext_inner.pyx
include gensim/models/_utils_any2vec.c
include gensim/models/_utils_any2vec.pyx
include gensim/corpora/_mmreader.c
include gensim/corpora/_mmreader.pyx
include gensim/_matutils.c
Expand Down
2 changes: 2 additions & 0 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ Modules:
models/coherencemodel
models/basemodel
models/callbacks
models/utils_any2vec
models/_utils_any2vec
models/wrappers/ldamallet
models/wrappers/dtmmodel
models/wrappers/ldavowpalwabbit.rst
Expand Down
9 changes: 9 additions & 0 deletions docs/src/models/_utils_any2vec.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:mod:`models._utils_any2vec` -- Cython utils for any2vec models
===============================================================

.. automodule:: gensim.models._utils_any2vec
:synopsis: Cython utils for any2vec models
:members:
:inherited-members:
:undoc-members:
:show-inheritance:
9 changes: 9 additions & 0 deletions docs/src/models/utils_any2vec.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:mod:`models.utils_any2vec` -- Utils for any2vec models
=======================================================

.. automodule:: gensim.models.utils_any2vec
:synopsis: Utils for any2vec models
:members:
:inherited-members:
:undoc-members:
:show-inheritance:
Loading

0 comments on commit 9021ea8

Please sign in to comment.