Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend BM25Retriever to work with non-Elasticsearch based DocumentStores #3509

Closed
brandenchan opened this issue Oct 31, 2022 · 2 comments
Closed

Comments

@brandenchan
Copy link
Contributor

Is your feature request related to a problem? Please describe.

  • In the coming rework of Tutorials, the first tutorial uses an InMemoryDocumentStore and a TFIDFRetriever. It would be preferable to use the BM25Retriever instead since it is an improvement over TFIDF but it is currently incompatible with the InMemoryDocumentStore.

Describe the solution you'd like

  • Implement BM25 retrieval that doesn't rely on the Elasticsearch implementation. This might involve using another BM25 Python library such as rank-bm25
@ZanSara
Copy link
Contributor

ZanSara commented Oct 31, 2022

I think this is a duplicate of #3447. Let's keep that one only?

In practice many other document stores can hardly support BM25, and I'm not sure it's worth stretching the support too much right now. Implementing support for BM25 with the current arch in a docstore like, say, Pinecone is going to look quite ugly and, above all, very slow. This is because, to add support for BM25:

  • the index would need to be rebuilt every time the docstore is instantiated (slower proportionally to the docstore size), or
  • every document would need to be updated every time a new document is added, making adding documents slower and slower as the document store size increases (i.e. adding the Nth document is going to be N times slower than adding the 1st).

Or at least that's my understanding of it. Open to be corrected!

Note that this is all true for InMemory as well, but being it a "toy" docstore, i'm not concerned because it will never be used in production with big datasets and therefore such overheads are manageable.

Another perspective on the same core issue: #1634 (comment)

@brandenchan
Copy link
Contributor Author

Ah I didn't see the other issues! Yes let's keep the discussion there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants