Skip to content

Commit

Permalink
Reduce memory use of the term similarity matrix constructor, deprecat…
Browse files Browse the repository at this point in the history
…e the positive_definite parameter, and extend normalization capabilities of the inner_product method (#2783)

* Deprecate SparseTermSimilarityMatrix's positive_definite parameter

* Reference paper on efficient implementation of soft cosine similarity

* Add example with Annoy indexer to SparseTermSimilarityMatrix

* Add example of obtaining word embeddings from SparseTermSimilarityMatrix

* Reduce space complexity of SparseTermSimilarityMatrix construction
Build matrix using arrays and bitfields rather than DOK sparse format

This work is based on the following blog post by @maciejkula:
https://maciejkula.github.io/2015/02/22/incremental-construction-of-sparse-matrices/

* Fix a typo in the soft cosine similarity Jupyter notebook

* Add human-readable string representation for TermSimilarityIndex

* Avoid sparse term similarity matrix computation when nonzero_limit <= 0

* Extend normalization in the inner_product method

Support the `maintain` vector normalization scheme.
Support separate vector normalization schemes for queries and documents.

* Remove a note in the docstring of SparseTermSimilarityMatrix

* Rerun continuous integration tests

* Use ==/!= to compare constant literals

* Add human-readable string representation for TermSimilarityIndex (cont.)

* Prod flake8 with a coding style violation in a docstring

* Collapse two lambdas into one internal function

* Revert "Prod flake8 with a coding style violation in a docstring"

This reverts commit 6557b84.

* Avoid str.format()

* Slice SparseTermSimilarityMatrix.inner_product tests by input types

* Remove similarity_type_code local variable

* Remove starting underscore from local function name

* Save indentation level and define populate_buffers function

* Extract SparseTermSimilarityMatrix constructor body to _create_source

* Extract NON_NEGATIVE_NORM_ASSERTION_MESSAGE to a module-level constant

* Extract cell assignment logic to cell_full local function

* Split variable swapping into three separate statements

* Extract normalization from the body of SparseTermSimilarityMatrix.inner_product

* Wrap overlong line

* Add test_inner_product_zerovector_zerovector and test_inner_product_zerovector_vector tests

* Further split test_inner_product into 63 test cases

* Raise ValueError when dictionary is empty
  • Loading branch information
Witiko authored Aug 7, 2020
1 parent 4b7e372 commit b308883
Show file tree
Hide file tree
Showing 4 changed files with 862 additions and 233 deletions.
4 changes: 2 additions & 2 deletions docs/notebooks/soft_cosine_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Number of documents: 3\n",
"Number of documents: 2274338\n",
"CPU times: user 2min 1s, sys: 1.9 s, total: 2min 3s\n",
"Wall time: 2min 56s\n"
]
Expand Down Expand Up @@ -259,7 +259,7 @@
" [preprocess(relcomment[\"RelCText\"]) for relcomment in thread[\"RelComments\"]])\n",
" for thread in api.load(\"semeval-2016-2017-task3-subtaskA-unannotated\")]))\n",
"\n",
"print(\"Number of documents: %d\" % len(documents))"
"print(\"Number of documents: %d\" % len(corpus))"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion gensim/similarities/docsim.py
Original file line number Diff line number Diff line change
Expand Up @@ -978,7 +978,7 @@ def get_similarities(self, query):
is_corpus, query = utils.is_corpus(query)
if not is_corpus and isinstance(query, numpy.ndarray):
query = [self.corpus[i] for i in query] # convert document indexes to actual documents
result = self.similarity_matrix.inner_product(query, self.corpus, normalized=True)
result = self.similarity_matrix.inner_product(query, self.corpus, normalized=(True, True))

if scipy.sparse.issparse(result):
return numpy.asarray(result.todense())
Expand Down
Loading

0 comments on commit b308883

Please sign in to comment.