Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

id_hash_keys is ignored #3236

Closed
1 task done
thobson opened this issue Sep 17, 2022 · 7 comments · Fixed by #3577
Closed
1 task done

id_hash_keys is ignored #3236

thobson opened this issue Sep 17, 2022 · 7 comments · Fixed by #3577
Assignees
Labels
type:bug Something isn't working

Comments

@thobson
Copy link

thobson commented Sep 17, 2022

Describe the bug
Passing id_hash_keys to an ingestion pipeline or individual nodes has no effect. For example given id_hash_keys = ["meta"] and two documents with different meta data, both will share the same id if the content is the same.

Error message
No error

Expected behavior
Generated document ids should be formed using a hash of the fields passed through the id_hash_keys parameter

Additional context
During debugging I've noticed that the Document constructor ignores the id_hash_keys parameter that is passed by the node e.g. a TextConverter node. This is possibly due to the issue with mutable default parameters in Python.

For example try this code:

# Different metadata

doc1 = Document(content="hello world", meta={'doc_id': '1'}, id_hash_keys=['meta'])
print(doc1.id)

doc2 = Document(content="hello world", meta={'doc_id': '2'}, id_hash_keys=['meta'])
print(doc2.id)
Output:
# Same generated ids
ab97467d60eb63b1533f6046eb7f610e
ab97467d60eb63b1533f6046eb7f610e

However if we explicitly pass all constructor arguments:

doc3 = Document("hello world", "text", None, None, {'doc_id': '3'}, None, ['meta'])
print(doc3.id)

doc4 = Document("hello world", "text", None, None, {'doc_id': '4'}, None, ['meta'])
print(doc4.id)
Output:
# Different generated ids
ee6af6c5a7999fae9b46a6c257fd957e
324d51b0738f74b683f7d69cbf2a58ef

Different ids are generated

To Reproduce
See above or try this code

def ingest(meta_id):
    doc_store = InMemoryDocumentStore()
    
    # Pass id_hash_keys to the TextConverter
    text_converter = TextConverter(valid_languages=['en'], progress_bar=False, id_hash_keys=["meta", "content"])

    # Pass id_hash_keys to the PreProcessor
    pre_processor = PreProcessor(progress_bar=False, id_hash_keys=["meta", "content"])
    
    p = Pipeline()
    p.add_node(component=text_converter, name="TextConverter", inputs=["File"])
    p.add_node(component=pre_processor, name="PreProcessor", inputs=["TextConverter"])
    p.add_node(component=doc_store, name="DocumentStore", inputs=["PreProcessor"])
    
    # Pass id_hash_keys to the pipeline
    params = {
        "debug": True,
        "id_hash_keys": ["meta", "content"]
    }
    
    meta = {
        "meta_id": meta_id
    }
    
    result = p.run(file_paths=["./test_doc.txt"], meta=meta, params=params)
    # Final generated id
    doc = result['documents'][0]
    # TextConverter generated id
    debug = result["_debug"]['TextConverter']
    text_converter_id = debug['output']['documents'][0].id

    print("Meta id".ljust(20), meta_id)
    print("TextConverter id".ljust(20), text_converter_id)
    print("Generated id".ljust(20), doc.id)

if __name__ == "__main__":
    ingest("1")
    print("")
    ingest("2")
Output:

Meta id                     1
TextConverter id     3eddb872a053055e226cb754159e098e
Generated id           3eddb872a053055e226cb754159e098e

Meta id                    2
TextConverter id     3eddb872a053055e226cb754159e098e
Generated id           3eddb872a053055e226cb754159e098e

FAQ Check

System:

  • OS: macOS Monterey
  • GPU/CPU: CPU
  • Haystack version (commit or version number): 1.8.0
  • DocumentStore: N/A
  • Reader: N/A
  • Retriever: N/A
@thobson
Copy link
Author

thobson commented Sep 17, 2022

A bit more digging reveals this code in the PreProcessor split function (line 431 in preprocessor.py):

documents = []
for i, txt in enumerate(text_splits):
    # I think this is the problem
    doc = Document(content=txt, meta=deepcopy(document.meta) or {}, id_hash_keys=id_hash_keys)
    doc.meta["_split_id"] = I
    if self.add_page_number:
        doc.meta["page"] = splits_pages[I]
    documents.append(doc)

return documents

The id_hash_keys argument is lost

@anakin87
Copy link
Member

anakin87 commented Sep 18, 2022

I made some experiments with Haystack 1.8.0 on Ubuntu 22.04 and Colab (=Ubuntu 18.04).
Python version: 3.7
You can find my notebook here.

Test on document instantiation:

doc1 = Document(content="hello world", meta={'doc_id': '1'}, id_hash_keys=['meta'])
print(doc1.id)
# 401d5da16fd3fc8a37752172a883f2dc

doc2 = Document(content="hello world", meta={'doc_id': '2'}, id_hash_keys=['meta'])
print(doc2.id)
# 347767162caba0d0869dcb013bcb572a

Ingestion test:

Meta id              1
TextConverter id     70f712e8a8c3d3017f11ccc5db02a8ea
Generated id         729fa6a177edd9f3ba9a55589bbda110

Meta id              2
TextConverter id     c2de7093742566ddff0398d8b113ac25
Generated id         866818724f2411a7f7c5680ad44870ec

@thobson My tests show different results than yours.
Are we sure we are using the same version? Any other ideas?

@thobson
Copy link
Author

thobson commented Sep 19, 2022

@anakin87 I've checked the haystack version and I'm definitely on 1.8.0. I also tried the same code on Colab and got the same results as you. I'm thinking this may be related to the Python version. My broken example used 3.9.9 on MacOS. I'll try some other versions and report back ...

@thobson
Copy link
Author

thobson commented Sep 19, 2022

This is very strange. The only place this is working for me is collab. I just fired up an ec2 instance with Python 3.10.4 and farm-haystack==1.8.0.

# bug.py
from haystack.schema import Document

doc1 = Document(content="hello world", meta={'doc_id': '1'}, id_hash_keys=['meta'])
print(doc1.id)

doc2 = Document(content="hello world", meta={'doc_id': '2'}, id_hash_keys=['meta'])
print(doc2.id)
$ python3 --version
Python 3.10.4

$ pip freeze | grep haystack
farm-haystack==1.8.0

$ python3 bug.py
ab97467d60eb63b1533f6046eb7f610e
ab97467d60eb63b1533f6046eb7f610e

I'll keep investigating

@masci
Copy link
Contributor

masci commented Sep 20, 2022

Confirmed on my side

(tmp-4d563e993a01222) ~/.virtualenvs/tmp-4d563e993a01222 sw_vers 
ProductName:	macOS
ProductVersion:	12.5.1
BuildVersion:	21G83
(tmp-4d563e993a01222) ~/.virtualenvs/tmp-4d563e993a01222 python -V
Python 3.10.6
(tmp-4d563e993a01222) ~/.virtualenvs/tmp-4d563e993a01222 pip freeze | grep haystack
farm-haystack==1.8.0
(tmp-4d563e993a01222) ~/.virtualenvs/tmp-4d563e993a01222 python bug.py 
ab97467d60eb63b1533f6046eb7f610e
ab97467d60eb63b1533f6046eb7f610e

@julian-risch
Copy link
Member

I tried to reproduce this issue with different Haystack versions and python version. The problem did occur with Haystack v1.8.0 but not with any version after that.

Python/Haystack Version 1.8.0 1.9.0 1.10.0
3.10
3.9
3.8

For that reason, I only added a test case to check for this issue in future in this PR #3577 and suggest to close this issue once that PR is merged. Please let me know if you have this issue with Haystack >v1.8.0. Thank you!

@thobson
Copy link
Author

thobson commented Nov 22, 2022

Confirmed this is working in 1.10.0. Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants