Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] pdf2parquet must calculate hash and size on the file #605

Open
1 of 2 tasks
sujee opened this issue Sep 20, 2024 · 7 comments
Open
1 of 2 tasks

[Bug] pdf2parquet must calculate hash and size on the file #605

sujee opened this issue Sep 20, 2024 · 7 comments
Assignees
Labels
bug Something isn't working fixed Marks an issues as fixed in the dev branch high priority

Comments

@sujee
Copy link
Contributor

sujee commented Sep 20, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

What happened + What you expected to happen

I had duplicate documents (see attached).
I was expecting the exact same duplicate files to have same size and hash.
But seems like the hash is being calculated on 'contents' which is actual content + meta data (like file name)

I think the hash and size should be calculated on the actual file/document not on meta data.

image

Expected Behaviour

  • hash should be identical to identical files
  • size should be physical file size in bytes
  • to avoid confusion, these columns can be renamed (or new columns can be created) with names like file_hash and file_size

Reproduction script

earth.pdf

Create a copy of the above file

execute the pdf2parquet section here : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@sujee sujee added the bug Something isn't working label Sep 20, 2024
@dolfim-ibm
Copy link
Member

dolfim-ibm commented Sep 20, 2024

At the moment the hash column contains the hash of the actual contents column. This is the JSON representation of the output, which has the property file-info.filename, so different filenames will have different content.

Internally, the JSON has a property file-info.document-hash which is the actual hash of the binary input file.

It could indeed make sense to expose that one as well. Where? Should it be the document_id? Another field? Happy for an open discussion here.

@sujee
Copy link
Contributor Author

sujee commented Sep 20, 2024

I do see document_hash in the contents.

I would like to see this propagated up as a top-level column in the output parquet. Along with actual file size.

image

@sujee
Copy link
Contributor Author

sujee commented Oct 29, 2024

@dolfim-ibm with the new Docling integration, will this be addressed as well?

@dolfim-ibm
Copy link
Member

Reading again above, there were some open questions about which field to expose and with which names. The fact of exposing both is for sure a good idea, since they serve different purposes.

@dolfim-ibm dolfim-ibm added the fixed Marks an issues as fixed in the dev branch label Nov 1, 2024
@dolfim-ibm
Copy link
Member

Should be fixed in #756.

@Bytes-Explorer
Copy link
Collaborator

@sujee Can you test and see if this can be closed?

@sujee
Copy link
Contributor Author

sujee commented Nov 7, 2024

pdf2pq now blocked on #767

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed Marks an issues as fixed in the dev branch high priority
Projects
None yet
Development

No branches or pull requests

3 participants