3.1.0
3.1.0, 2017-11-06
π New features:
-
Massive optimizations to LSI model training (@isamaru, #1620 & #1622)
-
LSI model allows use of single precision (float32), to consume 40% less memory while being 40% faster.
-
LSI model can now also accept CSC matrix as input, for further memory and speed boost.
-
Overall, if your entire corpus fits in RAM: 3x faster LSI training (SVD) in 4x less memory!
# just an example; the corpus stream is up to you streaming_corpus = gensim.corpora.MmCorpus("my_tfidf_corpus.mm.gz") # convert your corpus to a CSC sparse matrix (assumes the entire corpus fits in RAM) in_memory_csc_matrix = gensim.matutils.corpus2csc(streaming_corpus, dtype=np.float32) # then pass the CSC to LsiModel directly model = LsiModel(corpus=in_memory_csc_matrix, num_topics=500, dtype=np.float32)
-
Even if you continue to use streaming corpora (your training dataset is too large for RAM), you should see significantly faster processing times and a lower memory footprint. In our experiments with a very large LSI model, we saw a drop from 29 GB peak RAM and 38 minutes (before) to 19 GB peak RAM and 26 minutes (now):
model = LsiModel(corpus=streaming_corpus, num_topics=500, dtype=np.float32)
-
-
Add common terms to Phrases. Fix #1258 (@alexgarel, #1568)
-
Phrases allows to use common terms in bigrams. Before, if you are searching to reveal ngrams like
car_with_driver
andcar_without_driver
, you can either remove stop words before processing, but you will only findcar_driver
, or you won't find any of those forms (because they have three words, but also because high frequency of with will avoid them to be scored correctly), inspired by ES common grams token filter.phr_old = Phrases(corpus) phr_new = Phrases(corpus, common_terms=stopwords.words('en')) print(phr_old[["we", "provide", "car", "with", "driver"]]) # ["we", "provide", "car_with", "driver"] print(phr_new[["we", "provide", "car", "with", "driver"]]) # ["we", "provide", "car_with_driver"]
-
-
New segment_wiki.py script (@menshikh-iv, #1483 & #1694)
-
CLI script for processing a raw Wikipedia dump (the xml.bz2 format provided by MediaWiki) to extract its articles in a plain text format. It extracts each article's title, section names and section content and saves them as json-line:
python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 | gzip > enwiki-latest-pages-articles.json.gz
Processing the entire English Wikipedia dump (13.5 GB, link here) takes about 2.5 hours (i7-6700HQ, SSD).
The output format is one article per line, serialized into JSON:
for line in smart_open('enwiki-latest-pages-articles.json.gz'): # read the file we just created article = json.loads(line) print("Article title: %s" % article['title']) for section_title, section_text in zip(article['section_titles'], article['section_texts']): print("Section title: %s" % section_title) print("Section text: %s" % section_text)
-
π Improvements:
- Speedup FastText tests (@horpto, #1686)
- Add optimization for
SlicedCorpus.__len__
(@horpto, #1679) - Make
word_vec
return immutable vector. Fix #1651 (@CLearERR, #1662) - Drop Win x32 support & add rolling builds (@menshikh-iv, #1652)
- Fix scoring function in Phrases. Fix #1533, #1635 (@michaelwsherman, #1573)
- Add configuration for flake8 to setup.cfg (@mcobzarenco, #1636)
- Add
build_vocab_from_freq
to Word2Vec, speedup scan_vocab (@jodevak, #1599) - Add
most_similar_to_given
method for KeyedVectors (@TheMathMajor, #1582) - Add
__getitem__
method to Sparse2Corpus to allow direct queries (@isamaru, #1621)
π΄ Bug fixes:
- Add single core mode to CoherenceModel. Fix #1683 (@horpto, #1685)
- Fix ResourceWarnings in tests. Partially fix #1519 (@horpto, #1660)
- Fix DeprecationWarnings generated by deprecated assertEquals. Partial fix #1519 (@poornagurram, #1658)
- Fix DeprecationWarnings for regex string literals. Fix #1646 (@franklsf95, #1649)
- Fix pagerank algorithm. Fix #805 (@xelez, #1653)
- Fix FastText inconsistent dtype. Fix #1637 (@mcobzarenco, #1638)
- Fix
test_filename_filtering
test (@nehaljwani, #1647)
π Tutorial and doc improvements:
- Fix code/docstring style (@menshikh-iv, #1650)
- Update error message for supervised FastText. Fix #1498 (@ElSaico, #1645)
- Add "DOI badge" to README. Fix #1610 (@dphov, #1639)
- Remove duplicate annoy notebook. Fix #1415 (@Karamax, #1640)
- Fix duplication and wrong markup in docs (@horpto, #1633)
- Refactor dendrogram & topic network notebooks (@parulsethi, #1571)
- Fix release badge (@menshikh-iv, #1631)
-
Remove
gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
-
Move
gensim.scripts.make_wikicorpus
β‘gensim.scripts.make_wiki.py
gensim.summarization
β‘gensim.models.summarization
gensim.topic_coherence
β‘gensim.models._coherence
gensim.utils
β‘gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
β‘gensim.utils.text_utils
Also, we'll create experimental
subpackage for unstable models. Specific lists will be available in the next major release.