[Bug] pdf2parquet must calculate hash and size on the file #605

sujee · 2024-09-20T08:23:25Z

Search before asking

I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

What happened + What you expected to happen

I had duplicate documents (see attached).
I was expecting the exact same duplicate files to have same size and hash.
But seems like the hash is being calculated on 'contents' which is actual content + meta data (like file name)

I think the hash and size should be calculated on the actual file/document not on meta data.

Expected Behaviour

hash should be identical to identical files
size should be physical file size in bytes
to avoid confusion, these columns can be renamed (or new columns can be created) with names like file_hash and file_size

Reproduction script

earth.pdf

Create a copy of the above file

execute the pdf2parquet section here : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

dolfim-ibm · 2024-09-20T14:11:07Z

At the moment the hash column contains the hash of the actual contents column. This is the JSON representation of the output, which has the property file-info.filename, so different filenames will have different content.

Internally, the JSON has a property file-info.document-hash which is the actual hash of the binary input file.

It could indeed make sense to expose that one as well. Where? Should it be the document_id? Another field? Happy for an open discussion here.

sujee · 2024-09-20T17:26:41Z

I do see document_hash in the contents.

I would like to see this propagated up as a top-level column in the output parquet. Along with actual file size.

sujee · 2024-10-29T04:31:27Z

@dolfim-ibm with the new Docling integration, will this be addressed as well?

dolfim-ibm · 2024-10-29T07:28:43Z

Reading again above, there were some open questions about which field to expose and with which names. The fact of exposing both is for sure a good idea, since they serve different purposes.

dolfim-ibm · 2024-11-01T07:28:05Z

Should be fixed in #756.

Bytes-Explorer · 2024-11-06T11:59:54Z

@sujee Can you test and see if this can be closed?

sujee · 2024-11-07T06:26:08Z

pdf2pq now blocked on #767

sujee added the bug Something isn't working label Sep 20, 2024

daw3rd assigned dolfim-ibm Sep 20, 2024

Bytes-Explorer added the high priority label Oct 29, 2024

dolfim-ibm mentioned this issue Oct 30, 2024

Update pdf2parquet to Docling v2 #756

Merged

5 tasks

dolfim-ibm added the fixed Marks an issues as fixed in the dev branch label Nov 1, 2024

This was referenced Nov 14, 2024

Rename the "Intro" notebooks to call out specific functionality it supports (PDF to Embedings) #782

Open

[Bug] pdf2parquet: identical PDF files have different contents #812

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] pdf2parquet must calculate hash and size on the file #605

[Bug] pdf2parquet must calculate hash and size on the file #605

sujee commented Sep 20, 2024 •

edited

Loading

dolfim-ibm commented Sep 20, 2024 •

edited

Loading

sujee commented Sep 20, 2024 •

edited

Loading

sujee commented Oct 29, 2024

dolfim-ibm commented Oct 29, 2024

dolfim-ibm commented Nov 1, 2024

Bytes-Explorer commented Nov 6, 2024

sujee commented Nov 7, 2024 •

edited

Loading

[Bug] pdf2parquet must calculate hash and size on the file #605

[Bug] pdf2parquet must calculate hash and size on the file #605

Comments

sujee commented Sep 20, 2024 • edited Loading

Search before asking

Component

What happened + What you expected to happen

Expected Behaviour

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

dolfim-ibm commented Sep 20, 2024 • edited Loading

sujee commented Sep 20, 2024 • edited Loading

sujee commented Oct 29, 2024

dolfim-ibm commented Oct 29, 2024

dolfim-ibm commented Nov 1, 2024

Bytes-Explorer commented Nov 6, 2024

sujee commented Nov 7, 2024 • edited Loading

sujee commented Sep 20, 2024 •

edited

Loading

dolfim-ibm commented Sep 20, 2024 •

edited

Loading

sujee commented Sep 20, 2024 •

edited

Loading

sujee commented Nov 7, 2024 •

edited

Loading