-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
id_hash_keys is ignored #3236
Comments
A bit more digging reveals this code in the PreProcessor split function (line 431 in preprocessor.py): documents = []
for i, txt in enumerate(text_splits):
# I think this is the problem
doc = Document(content=txt, meta=deepcopy(document.meta) or {}, id_hash_keys=id_hash_keys)
doc.meta["_split_id"] = I
if self.add_page_number:
doc.meta["page"] = splits_pages[I]
documents.append(doc)
return documents The |
I made some experiments with Haystack 1.8.0 on Ubuntu 22.04 and Colab (=Ubuntu 18.04). Test on document instantiation:
Ingestion test:
@thobson My tests show different results than yours. |
@anakin87 I've checked the haystack version and I'm definitely on 1.8.0. I also tried the same code on Colab and got the same results as you. I'm thinking this may be related to the Python version. My broken example used 3.9.9 on MacOS. I'll try some other versions and report back ... |
This is very strange. The only place this is working for me is collab. I just fired up an ec2 instance with Python 3.10.4 and farm-haystack==1.8.0. # bug.py
from haystack.schema import Document
doc1 = Document(content="hello world", meta={'doc_id': '1'}, id_hash_keys=['meta'])
print(doc1.id)
doc2 = Document(content="hello world", meta={'doc_id': '2'}, id_hash_keys=['meta'])
print(doc2.id) $ python3 --version
Python 3.10.4
$ pip freeze | grep haystack
farm-haystack==1.8.0
$ python3 bug.py
ab97467d60eb63b1533f6046eb7f610e
ab97467d60eb63b1533f6046eb7f610e I'll keep investigating |
Confirmed on my side
|
I tried to reproduce this issue with different Haystack versions and python version. The problem did occur with Haystack v1.8.0 but not with any version after that.
For that reason, I only added a test case to check for this issue in future in this PR #3577 and suggest to close this issue once that PR is merged. Please let me know if you have this issue with Haystack >v1.8.0. Thank you! |
Confirmed this is working in 1.10.0. Thanks everyone! |
Describe the bug
Passing
id_hash_keys
to an ingestion pipeline or individual nodes has no effect. For example givenid_hash_keys = ["meta"]
and two documents with different meta data, both will share the same id if the content is the same.Error message
No error
Expected behavior
Generated document ids should be formed using a hash of the fields passed through the
id_hash_keys
parameterAdditional context
During debugging I've noticed that the Document constructor ignores the
id_hash_keys
parameter that is passed by the node e.g. a TextConverter node. This is possibly due to the issue with mutable default parameters in Python.For example try this code:
However if we explicitly pass all constructor arguments:
Different ids are generated
To Reproduce
See above or try this code
FAQ Check
System:
The text was updated successfully, but these errors were encountered: