Add support for BM25Retriever in InMemoryDocumentStore #3447

tholor · 2022-10-21T09:53:30Z

Is your feature request related to a problem? Please describe.
Many of our tutorials are using the ElasticsearchDocumentStore. While this is a good choice for a production system, it can be quite cumbersome to set up and run during your "first minutes with haystack". It would be awesome to use the InMemoryDocumentStore instead for the first steps of users (Tutorial 1, Quick Start ...). The only thing that I see holding us back here: InMemoryDocumentStore doesn't support our BM25Retriever which is a fast and easy retriever to get started with (and therefore a good choice in tutorials). Switching to another retriever might complicate tutorials, slow them down or reduce the quality of answers users get.

Describe the solution you'd like
Supporting the usage of the BM25Retriever in combination with the InMemoryDocumentStore

Describe alternatives you've considered
Using TFIDFRetriever but I am concerned about the quality of results and leading our users into a wrong direction here.

Priority
I don't think this feature is urgent but it might be a helpful step when we want to improve the early user experience

The text was updated successfully, but these errors were encountered:

ZanSara · 2022-10-21T09:58:29Z

Some additional context (same idea from @bglearning): deepset-ai/haystack-tutorials#44 (comment)

bogdankostic · 2022-10-21T10:22:55Z

BM25 was recently added to gensim: piskvorky/gensim#3304, we might use this.

vtharmalingam · 2022-10-26T10:51:22Z

@ZanSara: please advise me on how I can contribute to the haystack, in general, or to this issue, in particular. Thanks :)

ZanSara · 2022-10-26T11:01:07Z

Hello @vtharmalingam! Start from here: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md

julian-risch · 2022-11-03T14:19:35Z

Hi @vtharmalingam great to hear that you would like to contribute on this issue! Feel free to open a draft pull request early on so that we can help with feedback. Please let me know if you need any other help.

anakin87 · 2022-11-03T17:26:15Z

I am also interested in contributing to this feature!
Anyway, @vtharmalingam takes precedence.

After the discussions linked above, I want to bring some ideas.

I tend to think that the retriever should only query the document store. On the other hand, the ds is responsible for storing documents and their representations (such as BM25). Unlike the TF-IDF retriever implementation
To do that, InMemoryDocumentStore should be a subclass of KeywordDocumentStore (instead of BaseDocumentStore) and implement the methods query and query_batch.
I like the library rank_bm25: it is simple and lightweight. It must be said that it is less performant than Gensim at retrieval time, but maybe we could accept that, as this document store is not meant for production use cases.
About the practical implementation, I imagine that for every index, we could have an instance of the BM25 class:
self.indexes[index].bm25 = BM25Okapi(tokenized_corpus)
A doubt: should the BM25 representation be generated by default when working with the InMemoryDocumentStore?
It may not be essential if we are using the IMDS for dense retrieval or for TF-IDF retrieval; in those cases, BM25 computation would make the document store unnecessarily slow.

WDYT?

ZanSara · 2022-11-04T09:08:32Z

Hey @anakin87! Sounds good 🙁

I tend to think that the retriever should only query the document store. On the other hand, the ds is responsible for storing documents and their representations (such as BM25). Unlike the TF-IDF retriever implementation

To do that, InMemoryDocumentStore should be a subclass of KeywordDocumentStore (instead of BaseDocumentStore) and implement the methods query and query_batch.

Fully agree!

I like the library rank_bm25: it is simple and lightweight. It must be said that it is less performant than Gensim at retrieval time, but maybe we could accept that, as this document store is not meant for production use cases.

I also agree on this. It bring in no dependencies, which is a relief 😄 For now at least I imagine it would be a nice compromise. We should make sure that for workloads like the one in the tutorials it doesn't become too slow, however. Let's see after how many documents the 1s threshold for retrieval is reached.

About the practical implementation, I imagine that for every index, we could have an instance of the BM25 class:
self.indexes[index].bm25 = BM25Okapi(tokenized_corpus)

I trust you on this for now, I haven't read the rank_bm25 docs yet. in principle sounds good to me 👍

A doubt: should the BM25 representation be generated by default when working with the InMemoryDocumentStore?
It may not be essential if we are using the IMDS for dense retrieval or for TF-IDF retrieval; in those cases, BM25 computation would make the document store unnecessarily slow.

This issue really has no good answer in the current architecture, but so is the amazing retriever.document_store.update_embeddings(retriever=retriever) incantation, right? 🥲 My opinion is to not make it automatic. But we can do better than the update_embedding stuff above: let's make a __init__ param for this!

Something like docstore = InMemoryDocumentStore(bm25_ranks=True). Let's also leave the update_bm25_ranks() method public, so if one wants to manually update the ranks, it can still be done. But I'd like it to be an init param so one can set it and forget.

Then we can do the same to the other docstores for update_embeddings... but in another PR I guess 😄

anakin87 · 2022-11-10T21:03:34Z

I'm starting to work on this... 🛠️

tholor added type:feature New feature or request topic:document_store topic:retriever labels Oct 21, 2022

ZanSara added the Contributions wanted! Looking for external contributions label Oct 21, 2022

anakin87 mentioned this issue Oct 27, 2022

Feature Request: Add index parameter to TFiDF retriever #1634

Closed

ZanSara mentioned this issue Oct 31, 2022

Extend BM25Retriever to work with non-Elasticsearch based DocumentStores #3509

Closed

masci assigned julian-risch Nov 2, 2022

anakin87 mentioned this issue Nov 12, 2022

feat: add support for BM25Retriever in InMemoryDocumentStore #3561

Merged

6 tasks

ZanSara closed this as completed in #3561 Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for BM25Retriever in InMemoryDocumentStore #3447

Add support for BM25Retriever in InMemoryDocumentStore #3447

tholor commented Oct 21, 2022

ZanSara commented Oct 21, 2022

bogdankostic commented Oct 21, 2022

vtharmalingam commented Oct 26, 2022

ZanSara commented Oct 26, 2022 •

edited

Loading

julian-risch commented Nov 3, 2022

anakin87 commented Nov 3, 2022 •

edited

Loading

ZanSara commented Nov 4, 2022 •

edited

Loading

anakin87 commented Nov 10, 2022

Add support for BM25Retriever in InMemoryDocumentStore #3447

Add support for BM25Retriever in InMemoryDocumentStore #3447

Comments

tholor commented Oct 21, 2022

ZanSara commented Oct 21, 2022

bogdankostic commented Oct 21, 2022

vtharmalingam commented Oct 26, 2022

ZanSara commented Oct 26, 2022 • edited Loading

julian-risch commented Nov 3, 2022

anakin87 commented Nov 3, 2022 • edited Loading

ZanSara commented Nov 4, 2022 • edited Loading

anakin87 commented Nov 10, 2022

ZanSara commented Oct 26, 2022 •

edited

Loading

anakin87 commented Nov 3, 2022 •

edited

Loading

ZanSara commented Nov 4, 2022 •

edited

Loading