You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。
OS 系统
Ubuntu
Installation Method 安装方式
pip
Data-Juicer Version Data-Juicer版本
latest v0.1.3
Python Version Python版本
3.10.16
Describe the bug 描述这个bug
when i use cluster to run ray_bts_minhash_deduplicator,machine report this failure
2025-01-14 19:03:07 | ERROR | data_juicer.core.ray_data:198 - An error occurred during Op [ray_bts_minhash_deduplicator].
ray.data.exceptions.SystemException
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/data2/datajuicer/data-juicer-main-1.0.3/data_juicer/core/ray_data.py", line 192, in _run_single_op
self.data = op.run(self.data)
File "/data/data2/datajuicer/data-juicer-main-1.0.3/data_juicer/ops/deduplicator/ray_bts_minhash_deduplicator.py", line 555, in run
).write_parquet(tmp_dir)
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/dataset.py", line 2720, in write_parquet
self.write_datasink(
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/dataset.py", line 3544, in write_datasink
self._write_ds = Dataset(plan, logical_plan).materialize()
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/dataset.py", line 4502, in materialize
copy._plan.execute()
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/exceptions.py", line 86, in handle_trace
raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(FileNotFoundError): ray::MapBatches(minhash_with_uid)->Write() (pid=3889009, ip=51.38.76.183)
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in call
yield from self._block_fn(input, ctx)
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/_internal/planner/plan_write_op.py", line 26, in fn
write_result = datasink_or_legacy_datasource.write(blocks, ctx)
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/datasource/parquet_datasink.py", line 78, in write
call_with_retry(
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/_internal/util.py", line 986, in call_with_retry
raise e from None
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/_internal/util.py", line 973, in call_with_retry
return f()
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/datasource/parquet_datasink.py", line 70, in write_blocks_to_path
with self.open_output_stream(write_path) as file:
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/datasource/file_datasink.py", line 79, in open_output_stream
return self.filesystem.open_output_stream(path, **self.open_stream_args)
File "pyarrow/_fs.pyx", line 887, in pyarrow._fs.FileSystem.open_output_stream
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file '/data/data2/datajuicer/data-juicer-main-1.0.3/outputs/demo-dedup/.tmp/01000000/1_000001_000000.parquet'. Detail: [errno 2] No such file or directory
2025-01-14 19:03:07 | INFO | data_juicer.core.ray_executor:30 - Removing tmp dir /data/data2/datajuicer/data-juicer-main-1.0.3/outputs/demo-dedup/.tmp/01000000 ...
# Process config example for dataset# global parametersproject_name: 'demo-dedup'dataset_path: './demos/process_on_ray/data/'export_path: './outputs/demo-dedup/demo-ray-bts-dedup-processed'executor_type: 'ray'ray_address: 'auto'# process schedule# a list of several process operators with their argumentsprocess:
- ray_bts_minhash_deduplicator:
tokenization: 'character'
Logs 报错日志
No response
Screenshots 截图
No response
Additional 额外信息
No response
The text was updated successfully, but these errors were encountered:
Before Reporting 报告之前
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
Ubuntu
Installation Method 安装方式
pip
Data-Juicer Version Data-Juicer版本
latest v0.1.3
Python Version Python版本
3.10.16
Describe the bug 描述这个bug
when i use cluster to run ray_bts_minhash_deduplicator,machine report this failure
2025-01-14 19:03:07 | ERROR | data_juicer.core.ray_data:198 - An error occurred during Op [ray_bts_minhash_deduplicator].
ray.data.exceptions.SystemException
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/data2/datajuicer/data-juicer-main-1.0.3/data_juicer/core/ray_data.py", line 192, in _run_single_op
self.data = op.run(self.data)
File "/data/data2/datajuicer/data-juicer-main-1.0.3/data_juicer/ops/deduplicator/ray_bts_minhash_deduplicator.py", line 555, in run
).write_parquet(tmp_dir)
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/dataset.py", line 2720, in write_parquet
self.write_datasink(
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/dataset.py", line 3544, in write_datasink
self._write_ds = Dataset(plan, logical_plan).materialize()
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/dataset.py", line 4502, in materialize
copy._plan.execute()
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/exceptions.py", line 86, in handle_trace
raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(FileNotFoundError): ray::MapBatches(minhash_with_uid)->Write() (pid=3889009, ip=51.38.76.183)
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in call
yield from self._block_fn(input, ctx)
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/_internal/planner/plan_write_op.py", line 26, in fn
write_result = datasink_or_legacy_datasource.write(blocks, ctx)
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/datasource/parquet_datasink.py", line 78, in write
call_with_retry(
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/_internal/util.py", line 986, in call_with_retry
raise e from None
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/_internal/util.py", line 973, in call_with_retry
return f()
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/datasource/parquet_datasink.py", line 70, in write_blocks_to_path
with self.open_output_stream(write_path) as file:
File "/root/anaconda3/envs/juicer/lib/python3.10/site-packages/ray/data/datasource/file_datasink.py", line 79, in open_output_stream
return self.filesystem.open_output_stream(path, **self.open_stream_args)
File "pyarrow/_fs.pyx", line 887, in pyarrow._fs.FileSystem.open_output_stream
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file '/data/data2/datajuicer/data-juicer-main-1.0.3/outputs/demo-dedup/.tmp/01000000/1_000001_000000.parquet'. Detail: [errno 2] No such file or directory
2025-01-14 19:03:07 | INFO | data_juicer.core.ray_executor:30 - Removing tmp dir /data/data2/datajuicer/data-juicer-main-1.0.3/outputs/demo-dedup/.tmp/01000000 ...
To Reproduce 如何复现
python tools/process_data.py --config demos/process_on_ray/configs/dedup.yaml
Configs 配置信息
Logs 报错日志
No response
Screenshots 截图
No response
Additional 额外信息
No response
The text was updated successfully, but these errors were encountered: