feat: add support for `BM25Retriever` in `InMemoryDocumentStore` #3561

anakin87 · 2022-11-12T12:24:37Z

Related Issues

fixes Add support for BM25Retriever in InMemoryDocumentStore #3447

Only a first draft...

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

anakin87 · 2022-11-13T18:47:23Z

Proposed Changes:

As discussed in #3447, I made the InMemoryDocumentStore optionally store a BM25 sparse representation for each index; this representation is based on the simple library rank_bm25.

To make the InMemoryDocumentStore accept queries from the BM25Retriever, I changed the DS: now it is a subclass of KeywordDocumentStore (instead of BaseDocumentStore) and implement the methods query and query_batch.

How did you test it?

As you can see from this notebook, this implementation works fine for Haystack Tutorial 1.

For more proper tests, I need help:

probably it is possible to test this new behavior of the DS with some slight changes in test_retriever.py, but I'm struggling to understand how
any ideas on unit tests to add?

@ZanSara feel free to jump in! 🙂

haystack/document_stores/memory.py

ZanSara

Fantastic! Thank you for the great work! 🚀

Regarding tests, I believe they should come from two angles:

First we need unit tests as part of the document store tests. I'm sure there are no specific InMemoryDocumentStore tests now, but please add a few in the main docstore tests test_document_store.py, and parametrize them to run on memory only. @masci will take care of them shortly after we merge this PR
Then we can test them from BM25Retriever side. I think it's sufficient to add (elasticsearch, memory) to this fixture: https://github.com/deepset-ai/haystack/blob/main/test/nodes/test_retriever.py#L33-L53
- Incidentally this would be a good time to fix the parametrization of these tests and replace elasticsearch with bm25 to represent BM25Retriever in tests. The switch needs to be done:
  - Here:
    
    haystack/test/conftest.py
    
    Line 723 in 4dfddf0
    
    @pytest.fixture(params=["es_filter_only", "elasticsearch", "dpr", "embedding", "tfidf", "table_text_retriever"])
  - Here: https://github.com/deepset-ai/haystack/blob/main/test/conftest.py#L789
  - And in every test that uses the retriever and retriever_with_docs fixture 🥲 Please don't feel obliged to do this, I'll take care of it if it's too much work 👍

haystack/document_stores/memory.py

anakin87 · 2022-11-14T18:20:41Z

First we need unit tests as part of the document store tests. I'm sure there are no specific InMemoryDocumentStore tests now, but please add a few in the main docstore tests test_document_store.py, and parametrize them to run on memory only. @masci will take care of them shortly after we merge this PR

I added some document store tests. As usual, there is room for improvement... @ZanSara 😃

Then we can test them from BM25Retriever side. I think it's sufficient to add (elasticsearch, memory) to this fixture: https://github.com/deepset-ai/haystack/blob/main/test/nodes/test_retriever.py#L33-L53

Incidentally this would be a good time to fix the parametrization of these tests and replace elasticsearch with bm25 to represent BM25Retriever in tests.

I tried to replace elasticsearch with bm25 for BM25Retriever. Please check if everything is OK
I added (bm25, memory_bm25) to the fixture (and in conftest) in order to test separately sparse and dense retrieval for InMemoryDocumentStore. It seems to work...

ZanSara

Very nice! Thank you for taking care of the fixture renaming. There's one last thing that we should probably fix and then it can be merged 😊

test/nodes/test_retriever.py

test/conftest.py

test/document_stores/test_document_store.py

anakin87 · 2022-11-17T16:36:13Z

@ZanSara the CI is failing in a strange way:
AttributeError: module 'faiss' has no attribute 'swigfaiss'.

Any ideas?

julian-risch · 2022-11-18T08:03:05Z

@ZanSara the CI is failing in a strange way: AttributeError: module 'faiss' has no attribute 'swigfaiss'.

Any ideas?

Hi @anakin87 we will pin faiss-cpu with the following PR: #3603
There was a new faiss-cpu release that seems to be causing the problems.

ZanSara

Sorry to hold this one again but indeed you found a super valid corner case by setting use_bm25=True as default in tests. Good to see the test suite being really useful for once 😄

test/nodes/test_retriever.py

ZanSara

Great! Thank you so much! 🚀

ZanSara · 2022-11-22T08:27:40Z

@deepset-ai/documentation Let's update the docs to reflect this change

* Fix docstrings for DocumentStores * Fix docstrings for AnswerGenerator * Fix docstrings for Connector * Fix docstrings for DocumentClassifier * Fix docstrings for LabelGenerator * Fix docstrings for QueryClassifier * Fix docstrings for Ranker * Fix docstrings for Retriever and Summarizer * Fix docstrings for Translator * Fix docstrings for Pipelines * Fix docstrings for Primitives * Fix Python code block spacing * Add line break before code block * Fix code blocks * fix: discard metadata fields if not set in Weaviate (#3578) * fix weaviate bug in returning embeddings and setting empty meta fields * review comment * Update unstable version and openapi schema (#3584) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix: Flatten `DocumentClassifier` output in `SQLDocumentStore`; remove `_sql_session_rollback` hack in tests (#3273) * first draft * fix * fix * move test to test_sql * test: add test to check id_hash_keys is not ignored (#3577) * refactor: Generate JSON schema when missing (#3533) * removed unused script * print info logs when generating openapi schema * create json schema only when needed * fix tests * Remove leftover Co-authored-by: ZanSara <sarazanzo94@gmail.com> * move milvus tests to their own module (#3596) * feat: store metadata using JSON in SQLDocumentStore (#3547) * add warnings * make the field cachable * review comment * Pin faiss-cpu as 1.7.3 seems to have problems (#3603) * Update Haystack imports (#3599) * Update Python version (#3602) * fix: `ParsrConverter` fails on pages without text (#3605) * try to fix bug * remove print * leftover * refactor: update Squad data (#3513) * refractor the to_squad data class * fix the validation label * refractor the to_squad data class * fix the validation label * add the test for the to_label object function * fix the tests for to_label_objects * move all the test related to squad data to one file * remove unused imports * revert tiny_augmented.json Co-authored-by: ZanSara <sarazanzo94@gmail.com> * Url fixes (#3592) * add 2 example scripts * fixing faq script * fixing some urls * removing example scripts * black reformatting * add labeler to the repo (#3609) * convert eval metrics to python float (#3612) * feat: add support for `BM25Retriever` in `InMemoryDocumentStore` (#3561) * very first draft * implement query and query_batch * add more bm25 parameters * add rank_bm25 dependency * fix mypy * remove tokenizer callable parameter * remove unused import * only json serializable attributes * try to fix: pylint too-many-public-methods / R0904 * bm25 attribute always present * convert errors into warnings to make the tutorial 1 work * add docstrings; tests * try to make tests run * better docstrings; revert not running tests * some suggestions from review * rename elasticsearch retriever as bm25 in tests; try to test memory_bm25 * exclude tests with filters * change elasticsearch to bm25 retriever in test_summarizer * add tests * try to improve tests * better type hint * adapt test_table_text_retriever_embedding * handle non-textual docs * query only textual documents * Incorporate Reviewer feedback * refactor: replace `torch.no_grad` with `torch.inference_mode` (where possible) (#3601) * try to replace torch.no_grad * revert erroneous change * revert other module breaking * revert training/base * Fix docstrings for DocumentStores * Fix docstrings for AnswerGenerator * Fix docstrings for Connector * Fix docstrings for DocumentClassifier * Fix docstrings for LabelGenerator * Fix docstrings for QueryClassifier * Fix docstrings for Ranker * Fix docstrings for Retriever and Summarizer * Fix docstrings for Translator * Fix docstrings for Pipelines * Fix docstrings for Primitives * Fix Python code block spacing * Add line break before code block * Fix code blocks * Incorporate Reviewer feedback Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai> Co-authored-by: ZanSara <sarazanzo94@gmail.com> Co-authored-by: Espoir Murhabazi <espoir.mur@gmail.com> Co-authored-by: Tuana Celik <tuana.celik@deepset.ai> Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>

anakin87 added 15 commits November 10, 2022 21:12

Merge remote-tracking branch 'origin/main' into imds_support_for_bm25

cf16c25

very first draft

ee89a34

implement query and query_batch

3d7e8aa

add more bm25 parameters

5890742

add rank_bm25 dependency

d94433f

fix mypy

ce5efae

remove tokenizer callable parameter

432eff7

remove unused import

91d40ff

only json serializable attributes

2e39f05

try to fix: pylint too-many-public-methods / R0904

343f1d4

bm25 attribute always present

514c248

convert errors into warnings to make the tutorial 1 work

03a35a2

add docstrings; tests

25d6d42

try to make tests run

707a81b

better docstrings; revert not running tests

ac67603

anakin87 commented Nov 13, 2022

View reviewed changes

haystack/document_stores/memory.py Show resolved Hide resolved

anakin87 commented Nov 13, 2022

View reviewed changes

haystack/document_stores/memory.py Show resolved Hide resolved

anakin87 marked this pull request as ready for review November 13, 2022 18:56

anakin87 requested a review from a team as a code owner November 13, 2022 18:56

anakin87 requested review from bogdankostic and removed request for a team November 13, 2022 18:56

anakin87 marked this pull request as draft November 13, 2022 19:05

anakin87 marked this pull request as ready for review November 13, 2022 19:05

ZanSara reviewed Nov 14, 2022

View reviewed changes

anakin87 added 3 commits November 14, 2022 18:53

some suggestions from review

2381901

Merge remote-tracking branch 'upstream/main' into imds_support_for_bm25

2d830c9

rename elasticsearch retriever as bm25 in tests; try to test memory_bm25

34fedfc

exclude tests with filters

bbd9faa

ZanSara suggested changes Nov 16, 2022

View reviewed changes

test/nodes/test_retriever.py Outdated Show resolved Hide resolved

test/conftest.py Outdated Show resolved Hide resolved

test/document_stores/test_document_store.py Outdated Show resolved Hide resolved

anakin87 mentioned this pull request Nov 16, 2022

Tutorial restructure draft deepset-ai/haystack-tutorials#44

Closed

try to improve tests

2e06683

anakin87 marked this pull request as draft November 17, 2022 16:23

anakin87 marked this pull request as ready for review November 17, 2022 16:24

Merge branch 'main' into imds_support_for_bm25

1ee1544

better type hint

ba89540

anakin87 mentioned this pull request Nov 17, 2022

Error when importing FAISSDocumentStore #3600

Closed

1 task

anakin87 added 2 commits November 18, 2022 12:25

Merge branch 'main' into imds_support_for_bm25

832ef82

adapt test_table_text_retriever_embedding

f64016e

anakin87 requested a review from ZanSara November 18, 2022 14:28

ZanSara suggested changes Nov 21, 2022

View reviewed changes

test/nodes/test_retriever.py Show resolved Hide resolved

anakin87 added 2 commits November 21, 2022 19:14

handle non-textual docs

99429f6

query only textual documents

aad1970

anakin87 requested a review from ZanSara November 21, 2022 19:23

ZanSara approved these changes Nov 22, 2022

View reviewed changes

ZanSara merged commit 3040e59 into deepset-ai:main Nov 22, 2022

ZanSara added type:feature New feature or request topic:document_store journey:first steps topic:retriever labels Nov 22, 2022

ZanSara requested a review from a team November 22, 2022 08:27

ZanSara added the action:needs documentation label Nov 22, 2022

anakin87 deleted the imds_support_for_bm25 branch November 22, 2022 08:34

anakin87 mentioned this pull request Dec 13, 2022

feat: add index parameter to TfidfRetriever #3666

Merged

6 tasks

anakin87 mentioned this pull request Feb 27, 2023

Explore setting up Elasticsearch document store ugm2/neural-search-demo#11

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for `BM25Retriever` in `InMemoryDocumentStore` #3561

feat: add support for `BM25Retriever` in `InMemoryDocumentStore` #3561

anakin87 commented Nov 12, 2022 •

edited

Loading

anakin87 commented Nov 13, 2022 •

edited

Loading

ZanSara left a comment

anakin87 commented Nov 14, 2022 •

edited

Loading

ZanSara left a comment

anakin87 commented Nov 17, 2022

julian-risch commented Nov 18, 2022

ZanSara left a comment

ZanSara left a comment

ZanSara commented Nov 22, 2022

feat: add support for BM25Retriever in InMemoryDocumentStore #3561

feat: add support for BM25Retriever in InMemoryDocumentStore #3561

Conversation

anakin87 commented Nov 12, 2022 • edited Loading

Related Issues

Checklist

anakin87 commented Nov 13, 2022 • edited Loading

Proposed Changes:

How did you test it?

ZanSara left a comment

Choose a reason for hiding this comment

anakin87 commented Nov 14, 2022 • edited Loading

ZanSara left a comment

Choose a reason for hiding this comment

anakin87 commented Nov 17, 2022

julian-risch commented Nov 18, 2022

ZanSara left a comment

Choose a reason for hiding this comment

ZanSara left a comment

Choose a reason for hiding this comment

ZanSara commented Nov 22, 2022

feat: add support for `BM25Retriever` in `InMemoryDocumentStore` #3561

feat: add support for `BM25Retriever` in `InMemoryDocumentStore` #3561

anakin87 commented Nov 12, 2022 •

edited

Loading

anakin87 commented Nov 13, 2022 •

edited

Loading

anakin87 commented Nov 14, 2022 •

edited

Loading