Skip to content

3.1.0

Compare
Choose a tag to compare
@menshikh-iv menshikh-iv released this 06 Nov 16:56
· 1119 commits to develop since this release

3.1.0, 2017-11-06

🌟 New features:

  • Massive optimizations to LSI model training (@isamaru, #1620 & #1622)

    • LSI model allows use of single precision (float32), to consume 40% less memory while being 40% faster.

    • LSI model can now also accept CSC matrix as input, for further memory and speed boost.

    • Overall, if your entire corpus fits in RAM: 3x faster LSI training (SVD) in 4x less memory!

      # just an example; the corpus stream is up to you
      streaming_corpus = gensim.corpora.MmCorpus("my_tfidf_corpus.mm.gz")
      
      # convert your corpus to a CSC sparse matrix (assumes the entire corpus fits in RAM)
      in_memory_csc_matrix = gensim.matutils.corpus2csc(streaming_corpus, dtype=np.float32)
      
      # then pass the CSC to LsiModel directly
      model = LsiModel(corpus=in_memory_csc_matrix, num_topics=500, dtype=np.float32)
    • Even if you continue to use streaming corpora (your training dataset is too large for RAM), you should see significantly faster processing times and a lower memory footprint. In our experiments with a very large LSI model, we saw a drop from 29 GB peak RAM and 38 minutes (before) to 19 GB peak RAM and 26 minutes (now):

      model = LsiModel(corpus=streaming_corpus, num_topics=500, dtype=np.float32)
  • Add common terms to Phrases. Fix #1258 (@alexgarel, #1568)

    • Phrases allows to use common terms in bigrams. Before, if you are searching to reveal ngrams like car_with_driver and car_without_driver, you can either remove stop words before processing, but you will only find car_driver, or you won't find any of those forms (because they have three words, but also because high frequency of with will avoid them to be scored correctly), inspired by ES common grams token filter.

      phr_old = Phrases(corpus)
      phr_new = Phrases(corpus, common_terms=stopwords.words('en'))
      
      print(phr_old[["we", "provide", "car", "with", "driver"]])  # ["we", "provide", "car_with", "driver"]
      print(phr_new[["we", "provide", "car", "with", "driver"]])  # ["we", "provide", "car_with_driver"]
  • New segment_wiki.py script (@menshikh-iv, #1483 & #1694)

    • CLI script for processing a raw Wikipedia dump (the xml.bz2 format provided by MediaWiki) to extract its articles in a plain text format. It extracts each article's title, section names and section content and saves them as json-line:

      python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 | gzip > enwiki-latest-pages-articles.json.gz

      Processing the entire English Wikipedia dump (13.5 GB, link here) takes about 2.5 hours (i7-6700HQ, SSD).

      The output format is one article per line, serialized into JSON:

       for line in smart_open('enwiki-latest-pages-articles.json.gz'):  # read the file we just created
           article = json.loads(line)
           print("Article title: %s" % article['title'])
           for section_title, section_text in zip(article['section_titles'], article['section_texts']):
               print("Section title: %s" % section_title)
               print("Section text: %s" % section_text)

πŸ‘ Improvements:

πŸ”΄ Bug fixes:

πŸ“š Tutorial and doc improvements:

⚠️ Deprecation part (will come into force in the next major release)

  • Remove

    • gensim.examples
    • gensim.nosy
    • gensim.scripts.word2vec_standalone
    • gensim.scripts.make_wiki_lemma
    • gensim.scripts.make_wiki_online
    • gensim.scripts.make_wiki_online_lemma
    • gensim.scripts.make_wiki_online_nodebug
    • gensim.scripts.make_wiki
  • Move

    • gensim.scripts.make_wikicorpus ➑ gensim.scripts.make_wiki.py
    • gensim.summarization ➑ gensim.models.summarization
    • gensim.topic_coherence ➑ gensim.models._coherence
    • gensim.utils ➑ gensim.utils.utils (old imports will continue to work)
    • gensim.parsing.* ➑ gensim.utils.text_utils

Also, we'll create experimental subpackage for unstable models. Specific lists will be available in the next major release.