4.0.0beta
Pre-release4.0.0beta, 2020-10-31
Main highlights
-
Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:
a. Efficiency
model 3.8.3
wall time / peak RAM / throughput4.0.0
wall time / peak RAM / throughputfastText 2.9h / 4.11 GB / 822k words/s 2.3h / 1.26 GB / 914k words/s word2vec 1.7h / 0.36 GB / 1685k words/s 1.2h / 0.33 GB / 1762k words/s In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. 4.0 benchmarks.
b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)
c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.
These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.
-
Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, wrappers for 3rd party libraries: Mallet, scikit-learn, DTM model, Vowpal Wabbit, wordrank, varembed.
-
Why? Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules and wrappers.
So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim, linked to as "contributed" from Gensim docs.
-
-
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
- If you still need Python 2 for some reason, stay at Gensim 3.8.3.
-
A new Gensim website – finally! 🙃
So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.
This is the direction we'll keep going forward: less kitchen-sink of "latest academic fad", more focus on robust engineering, targetting common NLP & document similarity use-cases.
Why a pre-release?
This 4.0.0beta pre-release is for users who want the cutting edge performance and bug fixes. Plus users who want to help out, by testing and providing feedback: code, documentation, workflows… Please let us know on the mailing list!
Install the pre-release with:
pip install --pre --upgrade gensim
What will change between this pre-release and a "full" 4.0 release?
Check progress here.
👍 Improvements
- #2947: Bump minimum Python version to 3.6, by @gojomo
- #2939 + #2984: Code style & py3 migration clean up, by @piskvorky
- #2300: Use less RAM in LdaMulticore, by @horpto
- #2698: Streamline KeyedVectors & X2Vec API, by @gojomo
- #2864: Speed up random number generation in word2vec, by @zygm0nt
- #2976: Speed up phrase (collocation) detection, by @piskvorky
- #2979: Allow skipping common English words in multi-word phrases, by @piskvorky
- #2867: Expose
max_final_vocab
parameter in fastText constructor, by @mpenkov - #2931: Clear up job queue parameters in word2vec, by @lunastera
- #2939: X2Vec SaveLoad improvements, by @piskvorky
📚 Tutorials and docs
- #2954: New theme for the Gensin website, @dvorakvaclav
- #2960: Added Gensim and Compatibility Wiki page, by @piskvorky
- #2960: Reworked & simplified the Developer Wiki page, by @piskvorky
- #2968: Migrate tutorials & how-tos to 4.0.0, by @piskvorky
- #2899: Clean up of language and formatting of docstrings, by @piskvorky
- #2899: Added documentation for NMSLIB indexer, by @piskvorky
- #2832: Clear up LdaModel documentation by @FyzHsn
- #2871: Clarify that license is LGPL-2.1, by @pombredanne
- #2896: Make docs clearer on
alpha
parameter in LDA model, by @xh2 - #2897: Update Hoffman paper link for Online LDA, by @xh2
- #2910: Refresh docs for run_annoy tutorial, by @piskvorky
- #2935: Fix "generator" language in word2vec docs, by @polm
🔴 Bug fixes
- #2891: Fix fastText word-vectors with ngrams off, by @gojomo
- #2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo
- #2899: Fix similarity bug in NMSLIB indexer, by @piskvorky
- #2899: Fix deprecation warnings in Annoy integration, by @piskvorky
- #2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah
- #2940; Fix deprecations in SoftCosineSimilarity, by @Witiko
- #2944: Fix
save_facebook_model
failure after update-vocab & other initialization streamlining, by @gojomo - #2846: Fix for Python 3.9/3.10: remove
xml.etree.cElementTree
, by @hugovk - #2973: phrases.export_phrases() doesn't yield all bigrams
- #2942: Segfault when training doc2vec
⚠️ Removed functionality & deprecations
- #6: No more binary wheels for x32 platforms, by menshikh-iv
- #2899: Renamed overly broad
similarities.index
to the more appropriatesimilarities.annoy
, by @piskvorky - #2958: Remove gensim.summarization subpackage, docs and test data, by @mpenkov
- #2926: Rename
num_words
totopn
in dtm_coherence, by @MeganStodel - #2937: Remove Keras dependency, by @piskvorky
- Removed all code, methods, attributes and functions marked as deprecated in Gensim 3.8.3.