Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with converting pdf files in the intro example when using release 0.2.2.dev2 #767

Closed
1 of 2 tasks
shahrokhDaijavad opened this issue Nov 4, 2024 · 5 comments · Fixed by #793
Closed
1 of 2 tasks
Assignees
Labels

Comments

@shahrokhDaijavad
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

Component

Other

What happened + What you expected to happen

In the notebook example here: https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb, we convert 2 pdf files earth.pdf and mars.pdf from the input/solar-system directory. We had no problems with the conversion to parquet of these 2 files, when using the older version of Docling library that was in 0.2.2.dev1 release, but with the latest Docling in 0.2.2.dev2, we get the following reproducible error:

11:09:57 ERROR - Fatal error with file file_name='/Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/earth.pdf'. No results produced.
11:09:57 WARNING - Exception processing file /Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/earth.pdf: Traceback (most recent call last):
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file
out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary
table = pa.Table.from_pylist(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist
File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist
File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1547, in pyarrow.lib._sanitize_arrays
File "pyarrow/table.pxi", line 1528, in pyarrow.lib._schema_from_arrays
File "pyarrow/array.pxi", line 355, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
OverflowError: Python int too large to convert to C long

11:09:57 INFO - Completed 1 files (50.0%) in 0.013 min
11:09:57 ERROR - Fatal error with file file_name='/Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/mars.pdf'. No results produced.
11:09:57 WARNING - Exception processing file /Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/mars.pdf: Traceback (most recent call last):
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file
out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary
table = pa.Table.from_pylist(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist
File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist
File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1547, in pyarrow.lib._sanitize_arrays
File "pyarrow/table.pxi", line 1528, in pyarrow.lib._schema_from_arrays
File "pyarrow/array.pxi", line 355, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
OverflowError: Python int too large to convert to C long

11:09:57 INFO - Completed 2 files (100.0%) in 0.021 min
11:09:57 INFO - Done processing 2 files, waiting for flush() completion.
11:09:57 INFO - done flushing in 0.0 sec
Traceback (most recent call last):
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/pure_python/transform_orchestrator.py", line 131, in orchestrate
stats["processing_time"] = round(stats["processing_time"], 3)
~~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'processing_time'
11:09:57 ERROR - Exception during execution 'processing_time': None
11:09:57 INFO - Completed execution in 0.085 min, execution result 1


Exception Traceback (most recent call last)
File :40

Exception: ❌ Job failed

Reproduction script

Run https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb, after pip installing release 0.2.2..dev2

Anything else

No response

OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@shahrokhDaijavad shahrokhDaijavad added the bug Something isn't working label Nov 4, 2024
@santoshborse
Copy link
Collaborator

I also experienced this issue with 0.2.2.dev2 it works fine with 0.2.2.dev1 but it seems model loading was slow ( took around 3 mins )

@sujee
Copy link
Contributor

sujee commented Nov 5, 2024

confirming with both py311 and py312 envs.

After installing 0.2.2.dev2 I had the following package

data_prep_toolkit            0.2.2.dev2
data_prep_toolkit_transforms 0.2.2.dev2

docling                      2.3.1
docling-core                 2.3.0
docling-ibm-models           2.0.3
docling-parse                2.0.2

@dolfim-ibm
Copy link
Member

Confirmed I can reproduce it locally. Using a fresh venv with the following packages

python3.11 -m venv venv
source venv/bin/activate

pip install \
    'data-prep-toolkit[ray]==0.2.2.dev2'  \
    'data-prep-toolkit-transforms[ray,pdf2parquet,doc_id,doc_chunk,ededup,text_encoder]==0.2.2.dev2'

pip install jupyterlab   ipykernel  ipywidgets

@dolfim-ibm
Copy link
Member

The issue seems to be related to pyarrow, which is not able to cast long uint64 integer to the right type.
The new transform is using efficient uint64 hashing for the binary content, which produces this issue.

This is a minimal example

import pyarrow as pa
pa.Table.from_pylist([{"binary_hash": 17915699055171962696}])

@dolfim-ibm
Copy link
Member

PR with fix: #793

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants