-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with converting pdf files in the intro example when using release 0.2.2.dev2 #767
Comments
I also experienced this issue with |
confirming with both py311 and py312 envs. After installing 0.2.2.dev2 I had the following package
|
Confirmed I can reproduce it locally. Using a fresh venv with the following packages python3.11 -m venv venv
source venv/bin/activate
pip install \
'data-prep-toolkit[ray]==0.2.2.dev2' \
'data-prep-toolkit-transforms[ray,pdf2parquet,doc_id,doc_chunk,ededup,text_encoder]==0.2.2.dev2'
pip install jupyterlab ipykernel ipywidgets |
The issue seems to be related to pyarrow, which is not able to cast long uint64 integer to the right type. This is a minimal example import pyarrow as pa
pa.Table.from_pylist([{"binary_hash": 17915699055171962696}]) |
PR with fix: #793 |
Search before asking
Component
Other
What happened + What you expected to happen
In the notebook example here: https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb, we convert 2 pdf files earth.pdf and mars.pdf from the input/solar-system directory. We had no problems with the conversion to parquet of these 2 files, when using the older version of Docling library that was in 0.2.2.dev1 release, but with the latest Docling in 0.2.2.dev2, we get the following reproducible error:
11:09:57 ERROR - Fatal error with file file_name='/Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/earth.pdf'. No results produced.
11:09:57 WARNING - Exception processing file /Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/earth.pdf: Traceback (most recent call last):
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file
out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary
table = pa.Table.from_pylist(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist
File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist
File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1547, in pyarrow.lib._sanitize_arrays
File "pyarrow/table.pxi", line 1528, in pyarrow.lib._schema_from_arrays
File "pyarrow/array.pxi", line 355, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
OverflowError: Python int too large to convert to C long
11:09:57 INFO - Completed 1 files (50.0%) in 0.013 min
11:09:57 ERROR - Fatal error with file file_name='/Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/mars.pdf'. No results produced.
11:09:57 WARNING - Exception processing file /Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/mars.pdf: Traceback (most recent call last):
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file
out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary
table = pa.Table.from_pylist(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist
File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist
File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1547, in pyarrow.lib._sanitize_arrays
File "pyarrow/table.pxi", line 1528, in pyarrow.lib._schema_from_arrays
File "pyarrow/array.pxi", line 355, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
OverflowError: Python int too large to convert to C long
11:09:57 INFO - Completed 2 files (100.0%) in 0.021 min
11:09:57 INFO - Done processing 2 files, waiting for flush() completion.
11:09:57 INFO - done flushing in 0.0 sec
Traceback (most recent call last):
File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/pure_python/transform_orchestrator.py", line 131, in orchestrate
stats["processing_time"] = round(stats["processing_time"], 3)
~~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'processing_time'
11:09:57 ERROR - Exception during execution 'processing_time': None
11:09:57 INFO - Completed execution in 0.085 min, execution result 1
Exception Traceback (most recent call last)
File :40
Exception: ❌ Job failed
Reproduction script
Run https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb, after pip installing release 0.2.2..dev2
Anything else
No response
OS
MacOS (limited support)
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: