Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Error while running doc_chunk transform #794

Closed
1 of 2 tasks
touma-I opened this issue Nov 11, 2024 · 2 comments
Closed
1 of 2 tasks

[Bug] Error while running doc_chunk transform #794

touma-I opened this issue Nov 11, 2024 · 2 comments
Assignees
Labels
bug Something isn't working merged Keep for a few days after merge to get initial feedback from users simplify-DPK

Comments

@touma-I
Copy link
Collaborator

touma-I commented Nov 11, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Other

What happened + What you expected to happen

Error while using doc_chunk transform:

17:03:33 INFO - doc_chunk parameters are : {'chunking_type': <chunking_types.DL_JSON: 'dl_json'>, 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30, 'dl_min_chunk_len': None}
17:03:33 INFO - pipeline id pipeline_id
17:03:33 INFO - code location None
17:03:33 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
17:03:33 INFO - actor creation delay 0
17:03:33 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}
17:03:33 INFO - data factory data_ is using local data access: input_folder - output[/01_parquet_out](http://localhost:8888/01_parquet_out) output_folder - output[/02_chunk_out](http://localhost:8888/02_chunk_out)
17:03:33 INFO - data factory data_ max_files -1, n_sample -1
17:03:33 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
17:03:33 INFO - Running locally
2024-11-11 17:03:34,248	INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
(orchestrate pid=85221) 17:03:36 INFO - orchestrator started at 2024-11-11 17:03:36
(orchestrate pid=85221) 17:03:36 INFO - Number of files is 2, source profile {'max_file_size': 0.006812095642089844, 'min_file_size': 0.006754875183105469, 'total_file_size': 0.013566970825195312}
(orchestrate pid=85221) 17:03:36 INFO - Cluster resources: {'cpus': 12, 'gpus': 0, 'memory': 15.994012451730669, 'object_store': 2.0}
(orchestrate pid=85221) 17:03:36 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each
(orchestrate pid=85221) 17:03:39 INFO - Completed 0 files (0.0%)  in 0.0 min. Waiting for completion
(orchestrate pid=85221) 17:03:39 INFO - Completed processing 2 files in 0.0 min
(orchestrate pid=85221) 17:03:39 INFO - done flushing in 0.001 sec
(RayTransformFileProcessor pid=85226) 17:03:39 WARNING - Exception processing file /Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/intro/output/01_parquet_out/mars.parquet: Traceback (most recent call last):
(RayTransformFileProcessor pid=85226)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py#line=78), in process_file
(RayTransformFileProcessor pid=85226)     out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
(RayTransformFileProcessor pid=85226)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85226)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/data_processing/transform/table_transform.py", line 59](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/data_processing/transform/table_transform.py#line=58), in transform_binary
(RayTransformFileProcessor pid=85226)     out_tables, stats = self.transform(table=table, file_name=file_name)
(RayTransformFileProcessor pid=85226)                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85226)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/doc_chunk_transform.py", line 154](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/doc_chunk_transform.py#line=153), in transform
(RayTransformFileProcessor pid=85226)     table = pa.Table.from_pylist(data)
(RayTransformFileProcessor pid=85226)             ^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85226)   File "pyarrow[/table.pxi", line 1984](http://localhost:8888/table.pxi#line=1983), in pyarrow.lib._Tabular.from_pylist
(RayTransformFileProcessor pid=85226)   File "pyarrow[/table.pxi", line 6044](http://localhost:8888/table.pxi#line=6043), in pyarrow.lib._from_pylist
(RayTransformFileProcessor pid=85226)   File "pyarrow[/table.pxi", line 4625](http://localhost:8888/table.pxi#line=4624), in pyarrow.lib.Table.from_arrays
(RayTransformFileProcessor pid=85226)   File "pyarrow[/table.pxi", line 1547](http://localhost:8888/table.pxi#line=1546), in pyarrow.lib._sanitize_arrays
(RayTransformFileProcessor pid=85226)   File "pyarrow[/table.pxi", line 1528](http://localhost:8888/table.pxi#line=1527), in pyarrow.lib._schema_from_arrays
(RayTransformFileProcessor pid=85226)   File "pyarrow[/array.pxi", line 355](http://localhost:8888/array.pxi#line=354), in pyarrow.lib.array
(RayTransformFileProcessor pid=85226)   File "pyarrow[/array.pxi", line 42](http://localhost:8888/array.pxi#line=41), in pyarrow.lib._sequence_to_array
(RayTransformFileProcessor pid=85226)   File "pyarrow[/error.pxi", line 154](http://localhost:8888/error.pxi#line=153), in pyarrow.lib.pyarrow_internal_check_status
(RayTransformFileProcessor pid=85226)   File "pyarrow[/error.pxi", line 88](http://localhost:8888/error.pxi#line=87), in pyarrow.lib.check_status
(RayTransformFileProcessor pid=85226) OverflowError: Python int too large to convert to C long
(RayTransformFileProcessor pid=85226) 
(RayTransformFileProcessor pid=85227) 
(raylet) [2024-11-11 17:03:44,270 E 85209 25533550] (raylet) file_system_monitor.cc:111: [/tmp/ray/session_2024-11-11_17-03-33_299724_85071](http://localhost:8888/tmp/ray/session_2024-11-11_17-03-33_299724_85071) is over 95% full, available space: 10144796672; capacity: 249999998976. Object creation will fail if spilling is required.
(RayTransformFileProcessor pid=85227) 17:03:39 WARNING - Exception processing file [/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/intro/output/01_parquet_out/earth.parquet](http://localhost:8888/lab/workspaces/intro/output/01_parquet_out/earth.parquet): Traceback (most recent call last):
(RayTransformFileProcessor pid=85227)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py#line=78), in process_file
(RayTransformFileProcessor pid=85227)     out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
(RayTransformFileProcessor pid=85227)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85227)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/data_processing/transform/table_transform.py", line 59](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/data_processing/transform/table_transform.py#line=58), in transform_binary
(RayTransformFileProcessor pid=85227)     out_tables, stats = self.transform(table=table, file_name=file_name)
(RayTransformFileProcessor pid=85227)                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85227)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/doc_chunk_transform.py", line 154](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/doc_chunk_transform.py#line=153), in transform
(RayTransformFileProcessor pid=85227)     table = pa.Table.from_pylist(data)
(RayTransformFileProcessor pid=85227)             ^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85227)   File "pyarrow[/table.pxi", line 1984](http://localhost:8888/table.pxi#line=1983), in pyarrow.lib._Tabular.from_pylist
(RayTransformFileProcessor pid=85227)   File "pyarrow[/table.pxi", line 6044](http://localhost:8888/table.pxi#line=6043), in pyarrow.lib._from_pylist
(RayTransformFileProcessor pid=85227)   File "pyarrow[/table.pxi", line 4625](http://localhost:8888/table.pxi#line=4624), in pyarrow.lib.Table.from_arrays
(RayTransformFileProcessor pid=85227)   File "pyarrow[/table.pxi", line 1547](http://localhost:8888/table.pxi#line=1546), in pyarrow.lib._sanitize_arrays
(RayTransformFileProcessor pid=85227)   File "pyarrow[/table.pxi", line 1528](http://localhost:8888/table.pxi#line=1527), in pyarrow.lib._schema_from_arrays
(RayTransformFileProcessor pid=85227)   File "pyarrow[/array.pxi", line 355](http://localhost:8888/array.pxi#line=354), in pyarrow.lib.array
(RayTransformFileProcessor pid=85227)   File "pyarrow[/array.pxi", line 42](http://localhost:8888/array.pxi#line=41), in pyarrow.lib._sequence_to_array
(RayTransformFileProcessor pid=85227)   File "pyarrow[/error.pxi", line 154](http://localhost:8888/error.pxi#line=153), in pyarrow.lib.pyarrow_internal_check_status
(RayTransformFileProcessor pid=85227)   File "pyarrow[/error.pxi", line 88](http://localhost:8888/error.pxi#line=87), in pyarrow.lib.check_status
(RayTransformFileProcessor pid=85227) OverflowError: Python int too large to convert to C long
17:03:49 INFO - Completed execution in 0.265 min, execution result 0

Reproduction script

Code fragments to reproduce:

from data_processing_ray.runtime.ray import RayTransformLauncher
from doc_chunk_transform_ray import DocChunkRayTransformConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # doc_chunk arguments
    # ...
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(DocChunkRayTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

Anything else

Input folder for data files that is causing the error:

01_parquet_out.tar.gz

OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@touma-I
Copy link
Collaborator Author

touma-I commented Nov 13, 2024

@sujee If you are interested in testing the latest dev release that has this fix, please do pip install of 0.2.2.dev2. cc: @shahrokhDaijavad

@touma-I
Copy link
Collaborator Author

touma-I commented Nov 13, 2024

Run test successfully using RAG notebook. This issue can now be closed: cc @shahrokhDaijavad @dolfim-ibm

@touma-I touma-I closed this as completed Nov 13, 2024
@touma-I touma-I added the merged Keep for a few days after merge to get initial feedback from users label Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working merged Keep for a few days after merge to get initial feedback from users simplify-DPK
Projects
None yet
Development

No branches or pull requests

3 participants