Reduce memory use of the term similarity matrix constructor, deprecat…

…e the positive_definite parameter, and extend normalization capabilities of the inner_product method (#2783) * Deprecate SparseTermSimilarityMatrix's positive_definite parameter * Reference paper on efficient implementation of soft cosine similarity * Add example with Annoy indexer to SparseTermSimilarityMatrix * Add example of obtaining word embeddings from SparseTermSimilarityMatrix * Reduce space complexity of SparseTermSimilarityMatrix construction Build matrix using arrays and bitfields rather than DOK sparse format This work is based on the following blog post by @maciejkula: https://maciejkula.github.io/2015/02/22/incremental-construction-of-sparse-matrices/ * Fix a typo in the soft cosine similarity Jupyter notebook * Add human-readable string representation for TermSimilarityIndex * Avoid sparse term similarity matrix computation when nonzero_limit <= 0 * Extend normalization in the inner_product method Support the `maintain` vector normalization scheme. Support separate vector normalization schemes for queries and documents. * Remove a note in the docstring of SparseTermSimilarityMatrix * Rerun continuous integration tests * Use ==/!= to compare constant literals * Add human-readable string representation for TermSimilarityIndex (cont.) * Prod flake8 with a coding style violation in a docstring * Collapse two lambdas into one internal function * Revert "Prod flake8 with a coding style violation in a docstring" This reverts commit 6557b84. * Avoid str.format() * Slice SparseTermSimilarityMatrix.inner_product tests by input types * Remove similarity_type_code local variable * Remove starting underscore from local function name * Save indentation level and define populate_buffers function * Extract SparseTermSimilarityMatrix constructor body to _create_source * Extract NON_NEGATIVE_NORM_ASSERTION_MESSAGE to a module-level constant * Extract cell assignment logic to cell_full local function * Split variable swapping into three separate statements * Extract normalization from the body of SparseTermSimilarityMatrix.inner_product * Wrap overlong line * Add test_inner_product_zerovector_zerovector and test_inner_product_zerovector_vector tests * Further split test_inner_product into 63 test cases * Raise ValueError when dictionary is empty
piskvorky · Aug 7, 2020 · b308883 · b308883
1 parent 4b7e372
commit b308883
Show file tree

Hide file tree

Showing 4 changed files with 862 additions and 233 deletions.
diff --git a/docs/notebooks/soft_cosine_tutorial.ipynb b/docs/notebooks/soft_cosine_tutorial.ipynb
@@ -225,7 +225,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Number of documents: 3\n",
+      "Number of documents: 2274338\n",
       "CPU times: user 2min 1s, sys: 1.9 s, total: 2min 3s\n",
       "Wall time: 2min 56s\n"
      ]
@@ -259,7 +259,7 @@
     "        [preprocess(relcomment[\"RelCText\"]) for relcomment in thread[\"RelComments\"]])\n",
     "    for thread in api.load(\"semeval-2016-2017-task3-subtaskA-unannotated\")]))\n",
     "\n",
-    "print(\"Number of documents: %d\" % len(documents))"
+    "print(\"Number of documents: %d\" % len(corpus))"
    ]
   },
   {

diff --git a/gensim/similarities/docsim.py b/gensim/similarities/docsim.py
@@ -978,7 +978,7 @@ def get_similarities(self, query):
         is_corpus, query = utils.is_corpus(query)
         if not is_corpus and isinstance(query, numpy.ndarray):
             query = [self.corpus[i] for i in query]  # convert document indexes to actual documents
-        result = self.similarity_matrix.inner_product(query, self.corpus, normalized=True)
+        result = self.similarity_matrix.inner_product(query, self.corpus, normalized=(True, True))
 
         if scipy.sparse.issparse(result):
             return numpy.asarray(result.todense())