Problem with converting pdf files in the intro example when using release 0.2.2.dev2 #767

shahrokhDaijavad · 2024-11-04T19:23:19Z

Search before asking

I searched the issues and found no similar issues.

Component

Other

What happened + What you expected to happen

In the notebook example here: https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb, we convert 2 pdf files earth.pdf and mars.pdf from the input/solar-system directory. We had no problems with the conversion to parquet of these 2 files, when using the older version of Docling library that was in 0.2.2.dev1 release, but with the latest Docling in 0.2.2.dev2, we get the following reproducible error:

11:09:57 ERROR - Fatal error with file file_name='/Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/earth.pdf'. No results produced.
11:09:57 WARNING - Exception processing file /Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/earth.pdf: Traceback (most recent call last):
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file
out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary
table = pa.Table.from_pylist(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist
File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist
File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1547, in pyarrow.lib._sanitize_arrays
File "pyarrow/table.pxi", line 1528, in pyarrow.lib._schema_from_arrays
File "pyarrow/array.pxi", line 355, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
OverflowError: Python int too large to convert to C long

11:09:57 INFO - Completed 1 files (50.0%) in 0.013 min
11:09:57 ERROR - Fatal error with file file_name='/Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/mars.pdf'. No results produced.
11:09:57 WARNING - Exception processing file /Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/mars.pdf: Traceback (most recent call last):
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file
out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary
table = pa.Table.from_pylist(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist
File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist
File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1547, in pyarrow.lib._sanitize_arrays
File "pyarrow/table.pxi", line 1528, in pyarrow.lib._schema_from_arrays
File "pyarrow/array.pxi", line 355, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
OverflowError: Python int too large to convert to C long

11:09:57 INFO - Completed 2 files (100.0%) in 0.021 min
11:09:57 INFO - Done processing 2 files, waiting for flush() completion.
11:09:57 INFO - done flushing in 0.0 sec
Traceback (most recent call last):
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/pure_python/transform_orchestrator.py", line 131, in orchestrate
stats["processing_time"] = round(stats["processing_time"], 3)
~~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'processing_time'
11:09:57 ERROR - Exception during execution 'processing_time': None
11:09:57 INFO - Completed execution in 0.085 min, execution result 1

Exception Traceback (most recent call last)
File :40

Exception: ❌ Job failed

Reproduction script

Run https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb, after pip installing release 0.2.2..dev2

Anything else

No response

OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

santoshborse · 2024-11-04T21:33:31Z

I also experienced this issue with 0.2.2.dev2 it works fine with 0.2.2.dev1 but it seems model loading was slow ( took around 3 mins )

sujee · 2024-11-05T05:54:31Z

confirming with both py311 and py312 envs.

After installing 0.2.2.dev2 I had the following package

data_prep_toolkit            0.2.2.dev2
data_prep_toolkit_transforms 0.2.2.dev2

docling                      2.3.1
docling-core                 2.3.0
docling-ibm-models           2.0.3
docling-parse                2.0.2

dolfim-ibm · 2024-11-11T19:19:52Z

Confirmed I can reproduce it locally. Using a fresh venv with the following packages

python3.11 -m venv venv
source venv/bin/activate

pip install \
    'data-prep-toolkit[ray]==0.2.2.dev2'  \
    'data-prep-toolkit-transforms[ray,pdf2parquet,doc_id,doc_chunk,ededup,text_encoder]==0.2.2.dev2'

pip install jupyterlab   ipykernel  ipywidgets

dolfim-ibm · 2024-11-11T19:49:27Z

The issue seems to be related to pyarrow, which is not able to cast long uint64 integer to the right type.
The new transform is using efficient uint64 hashing for the binary content, which produces this issue.

This is a minimal example

import pyarrow as pa
pa.Table.from_pylist([{"binary_hash": 17915699055171962696}])

dolfim-ibm · 2024-11-11T19:55:09Z

PR with fix: #793

shahrokhDaijavad added the bug Something isn't working label Nov 4, 2024

shahrokhDaijavad assigned dolfim-ibm Nov 4, 2024

Bytes-Explorer added the high priority label Nov 6, 2024

sujee mentioned this issue Nov 7, 2024

[Bug] pdf2parquet must calculate hash and size on the file #605

Open

2 tasks

dolfim-ibm mentioned this issue Nov 11, 2024

fix uint64 hash to pyarrow #793

Merged

touma-I closed this as completed in #793 Nov 11, 2024

touma-I added the simplify-DPK label Nov 12, 2024

shahrokhDaijavad mentioned this issue Nov 14, 2024

Rename the "Intro" notebooks to call out specific functionality it supports (PDF to Embedings) #782

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with converting pdf files in the intro example when using release 0.2.2.dev2 #767

Problem with converting pdf files in the intro example when using release 0.2.2.dev2 #767

shahrokhDaijavad commented Nov 4, 2024

santoshborse commented Nov 4, 2024

sujee commented Nov 5, 2024

dolfim-ibm commented Nov 11, 2024

dolfim-ibm commented Nov 11, 2024

dolfim-ibm commented Nov 11, 2024

Problem with converting pdf files in the intro example when using release 0.2.2.dev2 #767

Problem with converting pdf files in the intro example when using release 0.2.2.dev2 #767

Comments

shahrokhDaijavad commented Nov 4, 2024

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

santoshborse commented Nov 4, 2024

sujee commented Nov 5, 2024

dolfim-ibm commented Nov 11, 2024

dolfim-ibm commented Nov 11, 2024

dolfim-ibm commented Nov 11, 2024