Extend BM25Retriever to work with non-Elasticsearch based DocumentStores #3509

brandenchan · 2022-10-31T14:54:14Z

Is your feature request related to a problem? Please describe.

In the coming rework of Tutorials, the first tutorial uses an InMemoryDocumentStore and a TFIDFRetriever. It would be preferable to use the BM25Retriever instead since it is an improvement over TFIDF but it is currently incompatible with the InMemoryDocumentStore.

Describe the solution you'd like

Implement BM25 retrieval that doesn't rely on the Elasticsearch implementation. This might involve using another BM25 Python library such as rank-bm25

ZanSara · 2022-10-31T15:43:24Z

I think this is a duplicate of #3447. Let's keep that one only?

In practice many other document stores can hardly support BM25, and I'm not sure it's worth stretching the support too much right now. Implementing support for BM25 with the current arch in a docstore like, say, Pinecone is going to look quite ugly and, above all, very slow. This is because, to add support for BM25:

the index would need to be rebuilt every time the docstore is instantiated (slower proportionally to the docstore size), or
every document would need to be updated every time a new document is added, making adding documents slower and slower as the document store size increases (i.e. adding the Nth document is going to be N times slower than adding the 1st).

Or at least that's my understanding of it. Open to be corrected!

Note that this is all true for InMemory as well, but being it a "toy" docstore, i'm not concerned because it will never be used in production with big datasets and therefore such overheads are manageable.

Another perspective on the same core issue: #1634 (comment)

brandenchan · 2022-11-01T08:25:30Z

Ah I didn't see the other issues! Yes let's keep the discussion there

brandenchan closed this as completed Nov 1, 2022

yanqiangmiffy mentioned this issue Jun 1, 2024

Features-Retrievers-BM25 gomate-community/TrustRAG#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend BM25Retriever to work with non-Elasticsearch based DocumentStores #3509

Extend BM25Retriever to work with non-Elasticsearch based DocumentStores #3509

brandenchan commented Oct 31, 2022

ZanSara commented Oct 31, 2022 •

edited

Loading

brandenchan commented Nov 1, 2022

Extend BM25Retriever to work with non-Elasticsearch based DocumentStores #3509

Extend BM25Retriever to work with non-Elasticsearch based DocumentStores #3509

Comments

brandenchan commented Oct 31, 2022

ZanSara commented Oct 31, 2022 • edited Loading

brandenchan commented Nov 1, 2022

ZanSara commented Oct 31, 2022 •

edited

Loading