Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Okapi BM25 variants in Gensim #3304

Merged
merged 9 commits into from
Sep 8, 2022

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented Mar 4, 2022

This pull request implements the gensim.models.bm25model module, which contains an implementation of the Okapi BM25 model and its modifications (Lucene BM25 and ATIRE) as discussed in #2592 (comment). The module acts as a replacement for the gensim.summarization.bm25model module deprecated and removed in Gensim 4. The module should supersede the gensim.models.tfidfmodel module as the baseline weighting function for information retrieval and related NLP tasks.

Most implementations of BM25 such as the rank-bm25 library combine indexing with weighting and often forgo dictionary building for a speed improvement at indexing time (but a hefty penalty at retrieval time). To give an example, here is how a user would search for documents with rank-bm25:

>>> from rank_bm25 import BM25Okapi
>>>
>>> corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]]
>>> bm25_model = BM25Okapi(corpus)
>>>
>>> query = ["Hello", "bar"]
>>> similarities = bm25_model.get_scores(query)
>>> similarities

array([0.51082562, 0.09121886, 0.0638532 ])

>>> best_document, = bm25_model.get_top_n(query, corpus, n=1)
>>> best_document

['Hello', 'world']

As you can see, the interface is convenient, but retrieval is slow due to the lack of a dictionary. Furthermore, any advanced operations such as pruning the dictionary, applying semantic matching (e.g. SCM) and query expansion (e.g. RM3), or sharding the index are unavailable.

By contrast, the gensim.models.bm25 module separates the three operations. To give an example, here is how a user would search for documents with the gensim.models.bm25 module:

>>> from gensim.corpora import Dictionary
>>> from gensim.models import TfidfModel, OkapiBM25Model
>>> from gensim.similarities import SparseMatrixSimilarity
>>> import numpy as np
>>>
>>> corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]]
>>> dictionary = Dictionary(corpus)
>>> bm25_model = OkapiBM25Model(dictionary=dictionary)
>>> bm25_corpus = bm25_model[list(map(dictionary.doc2bow, corpus))]
>>> bm25_index = SparseMatrixSimilarity(bm25_corpus, num_docs=len(corpus), num_terms=len(dictionary),
...                                     normalize_queries=False, normalize_documents=False)
>>>
>>> query = ["Hello", "bar"]
>>> tfidf_model = TfidfModel(dictionary=dictionary, smartirs='bnn')  # Enforce binary weighting of queries
>>> tfidf_query = tfidf_model[dictionary.doc2bow(query)]
>>>
>>> similarities = bm25_index[tfidf_query]
>>> similarities

array([0.51082563, 0.09121886, 0.0638532 ], dtype=float32)

>>> best_document = corpus[np.argmax(similarities)]
>>> best_document

['Hello', 'world']

Tasks:

@piskvorky
Copy link
Owner

piskvorky commented Mar 4, 2022

Pretty nice! I'll look into this after the 4.2 release.

@Witiko Witiko force-pushed the feature/bm25 branch 2 times, most recently from 63804ce to 34d4281 Compare March 4, 2022 21:16
@codecov
Copy link

codecov bot commented Mar 4, 2022

Codecov Report

Merging #3304 (f43806d) into develop (ac3bbcd) will decrease coverage by 1.77%.
The diff coverage is 95.74%.

❗ Current head f43806d differs from pull request most recent head b4843cc. Consider uploading reports for the commit b4843cc to get more accurate results

@@             Coverage Diff             @@
##           develop    #3304      +/-   ##
===========================================
- Coverage    81.43%   79.66%   -1.78%     
===========================================
  Files          122       69      -53     
  Lines        21052    11875    -9177     
===========================================
- Hits         17144     9460    -7684     
+ Misses        3908     2415    -1493     
Impacted Files Coverage Δ
gensim/models/bm25model.py 95.74% <95.74%> (ø)
gensim/scripts/glove2word2vec.py 76.19% <0.00%> (-7.15%) ⬇️
gensim/corpora/wikicorpus.py 93.75% <0.00%> (-1.04%) ⬇️
gensim/matutils.py 77.23% <0.00%> (-0.90%) ⬇️
gensim/similarities/docsim.py 23.95% <0.00%> (-0.76%) ⬇️
gensim/models/rpmodel.py 89.47% <0.00%> (-0.53%) ⬇️
gensim/models/ldamulticore.py 90.58% <0.00%> (-0.33%) ⬇️
gensim/utils.py 71.86% <0.00%> (-0.12%) ⬇️
gensim/corpora/dictionary.py 94.17% <0.00%> (-0.09%) ⬇️
gensim/models/hdpmodel.py 71.27% <0.00%> (-0.08%) ⬇️
... and 91 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Witiko
Copy link
Contributor Author

Witiko commented Mar 5, 2022

@piskvorky Thank you. No need to look just yet, I have tried some benchmarks and the code seems to have issues both with speed and with correctness. I will let you know when the PR is ready for review; it is just a draft for the moment.

@Witiko Witiko force-pushed the feature/bm25 branch 8 times, most recently from 4b50675 to 9ab6f52 Compare March 7, 2022 12:34
@Witiko
Copy link
Contributor Author

Witiko commented Mar 7, 2022

I have experimentally confirmed compatibility of 9ab6f52 (Okapi BM25) with the rank-bm25 library:

I have also outlined some issues with the default behavior of the DenseMatrixSimilarity and SparseMatrixSimilarity indexes, which are likely to bite even experienced users and decrease the accuracy of their results with BM25, on the Gensim mailing list.

@piskvorky
Copy link
Owner

piskvorky commented Mar 7, 2022

The *MatrixSimilarity stuff is the oldest part of Gensim, along with the LsiModel. It dates back to DML-CZ days, in ancient pre-history :) (definitely pre-github)

To me it makes perfect sense to control the index & query normalization via a parameter. Are you able to add such option @Witiko? We have to keep the defaults 100% backward compatible though.

@Witiko
Copy link
Contributor Author

Witiko commented Mar 7, 2022

Not a problem, I added it to my task list. The SoftCosineSimilarity constructor uses a single parameter normalized that takes a two-tuple of booleans, one for queries and one for documents; let's deprecate that and have normalize_queries and normalize_documents across the SimilarityABC subclasses, seems more readable.

@piskvorky
Copy link
Owner

OK.

@Witiko Witiko force-pushed the feature/bm25 branch 3 times, most recently from f43806d to 9ab6f52 Compare March 16, 2022 00:53
@Witiko
Copy link
Contributor Author

Witiko commented Mar 16, 2022

In f43806d, I halfway-implemented BM25L, but I realized that it is difficult to fully implement as a sparse vector space model. That is because in BM25L, document vectors have the weight δ (typically set to δ = 0.5) for all terms that did not occur in the document, which eliminates sparsity. This could be implemented efficiently if scipy.sparse supported a flag that would make the value of zero elements not zero but a different constant, which I doubt it does. Alternatively, we could have a special-purpose index just for BM25L, but that seems to defeat the purpose of implementing things in Gensim, which is interoperability with other vector space concepts and models. Therefore, I plan to abandon BM25L and focus at BM25+ next.

@Witiko Witiko force-pushed the feature/bm25 branch 6 times, most recently from 53ec11f to fd283a4 Compare April 1, 2022 00:04
@ramsey-coding
Copy link

ramsey-coding commented Aug 27, 2022

@Witiko what's the equivalent API call for

  • bm25.get_top_n(tokenized_query, corpus, n=80)

@Witiko
Copy link
Contributor Author

Witiko commented Aug 27, 2022

I don't think there is an equivalent API call. You can get all similarities, run argmax over them and take the top 80.

@dunefox
Copy link

dunefox commented Aug 27, 2022

This would have been very useful for me during the last few weeks. Sadly, there doesn't seem to be much interest in BM25 here.

@ramsey-coding
Copy link

ramsey-coding commented Aug 27, 2022

I don't think there is an equivalent APU call. Get all similarities, run argmax over them and take the top 80.

@Witiko I don't follow it. You are saying for a given query, I iterate over the whole 350K datapoint (in the corpus) and get similarity and then take top 80.

This would not scale at all. 😭

@ramsey-coding
Copy link

ramsey-coding commented Aug 27, 2022

@Witiko also the API of gensim is neither user friendly nor convenient. It appears it just provide a similarity score and does not even provide the original document. Devs need to maintain an external data structure to retrieve the original document.

It appears to me Gensim was gold back in the days. But now it is an old, stale, and out-dated library and neither wants to move forward.

Probably time to abandon this library and devs should look for better alternative that provide easier API access and more functionalities.

@Witiko
Copy link
Contributor Author

Witiko commented Aug 27, 2022

What I am saying is that it will be significantly faster than rank-bm25 at retrieval time. (This is a continued discussion from dorianbrown/rank_bm25#27 and dorianbrown/rank_bm25#25.)

@ramsey-coding
Copy link

What I am saying is that it will be significantly faster than rank-bm25.

got it

@Witiko
Copy link
Contributor Author

Witiko commented Aug 27, 2022

Gensim will get you similarities in the order of indexing, i.e. if you index documents 1, 2, and 3, and then perform a similarity query, you will get back similarities between the query and documents 1, 2, and 3, respectively.

@ramsey-coding
Copy link

ramsey-coding commented Aug 27, 2022

@Witiko you are phenomenal, thanks for all the great feedback.

I have one more question:

If I set num_best=80 here:

SparseMatrixSimilarity(bm25_corpus,
                                        num_docs=len(corpus),
                                        num_terms=len(dictionary),
                                        normalize_queries=False,
                                        normalize_documents=False,
                                        num_best=80) // I set num_best=80 and want to get top 80 documents 

And then get similarity like the following, would the result would be sorted by most matched document?

    tfidf_model = TfidfModel(dictionary=bm25_dictionary, smartirs='bnn')  # Enforce binary weighting of queries
    tfidf_query = tfidf_model[bm25_dictionary.doc2bow(tokenized_query)]

    similarities = bm25_index[tfidf_query]
    for doc_no, score in bm25_index[tfidf_query]:
        print("original document:", test_methods_corpus[doc_no])

So question is here the result of bm25_index[tfidf_query] would be sorted based on most matched documents or not?

@smith-co
Copy link

@Witiko wow, awesome work. In the context of this implementation:

similarities = bm25_index[tfidf_query]
  • is higher score means the document is more similar to the query?
  • Or lower score means the document is more similar to the query.

Sorry for the stupid question.

@nashid
Copy link

nashid commented Aug 28, 2022

This feature would be very useful for me.

@Witiko
Copy link
Contributor Author

Witiko commented Aug 28, 2022

@smith-co The similarities are BM25 scores, i.e. the higher the similarity, the more similar the document is to your query.

@Witiko
Copy link
Contributor Author

Witiko commented Aug 28, 2022

@ramsey-coding @smith-co I added outputs to the example code in the original post. Furthermore, I also added an example showing how you can get back the best document for a query. I hope you will find this useful. 😉

@Witiko
Copy link
Contributor Author

Witiko commented Aug 29, 2022

So question is here the result of bm25_index[tfidf_query] would be sorted based on most matched documents or not?

@ramsey-coding Sorry, I wasn't at my computer over the weekend. Yes, your understanding is correct; specifying num_best=80 in SparseMatrixSimilarity(...) will cause bm25_index[tfidf_query] to produce an iterable of 80 document ids and similarities sorted from the most matched document in the descending order of similarity.

@mgeletka
Copy link

I would really appreciate merging this functionality as I must now use my own custom implementation of BM25 when working with the Gensim library.

gensim/models/bm25model.py Outdated Show resolved Hide resolved
gensim/models/bm25model.py Show resolved Hide resolved
@piskvorky
Copy link
Owner

Code looks nice and clean, sorry for taking so long to review.

@mpenkov anything else we need before merge?

@Witiko how about post-merge? What can we do to promote this functionality (beyond including it in the Gensim gallery)?

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>
@Witiko
Copy link
Contributor Author

Witiko commented Aug 30, 2022

@piskvorky Thank you for taking the time. We can mention in the release notes that Gensim can now be used for Lucene-style information retrieval.

@piskvorky
Copy link
Owner

piskvorky commented Aug 30, 2022

Sure, it goes into the release notes without saying.

I meant more like some demo, or a practical use-case (who'd use the gensim implementation and why?), or similar. A motivational part, to anchor the technical part.

Maybe @dunefox @mgeletka @smith-co @nashid @ramsey-coding could help?

@piskvorky
Copy link
Owner

@mpenkov anything missing here?

Let's aim to release soon after merging, to get this feature out. Thanks.

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 8, 2022

Sorry for the delay guys, merging.

Thank you for your efforts and your patience @Witiko

@mpenkov mpenkov merged commit 5dbfb1e into piskvorky:develop Sep 8, 2022
@piskvorky
Copy link
Owner

piskvorky commented Sep 8, 2022

Thanks Misha!

@dunefox @mgeletka @smith-co @nashid @ramsey-coding could you write a few sentences about how you use Okapi BM25, or intend to use it?

Your story, your use-case, your motivation to participate in this PR.

@ditengm
Copy link

ditengm commented Dec 9, 2022

Hello @Witiko!
Can you please tell me how to get word corpus embedding? That is, the corpus weight matrix?
Thanks!

@Witiko
Copy link
Contributor Author

Witiko commented Dec 9, 2022

Hello @Witiko!
Can you please tell me how to get word corpus embedding? That is, the corpus weight matrix?
Thanks!

Hello, @ditengm. You can get the BM25 weight matrix of your corpus from bm25_index.index, where bm25_index is the SparseMatrixSimilarity index from the second example in the original post. The type of bm25_index.index is scipy.sparse.csr_matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants