Error in running distributed task on ray cluster #522

awangzy · 2024-12-26T02:50:36Z

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

配置：

Installation Method

from source

Data-Juicer Version

v1.0.2

Python Version

3.9.15

Ray Version

2.31.0

Ray Cluster信息

两台机器都clone了data juicer项目并且安装（pip install -v -e .[dist]）
错误发生在ray任务被调度到非head节点上时
当ray任务被调度到head节点上时不会报错

demos.yaml

# Process config example for dataset

# global parameters
project_name: 'ray-demo'
executor_type: 'ray'
dataset_path: './demos/process_video_on_ray/data/demo-dataset.jsonl'  # path to your dataset directory or file
ray_address: '<head_node_ip>:<port>'   #default 'auto'                  # change to your ray cluster address, e.g., ray://<hostname>:<port>
export_path: './outputs/demo/demo-processed-ray-videos'

# process schedule
# a list of several process operators with their arguments
process:
  # Filter ops
  - video_duration_filter:
      min_duration: 20
      max_duration: 100
  - video_resolution_filter:                                # filter samples according to the resolution of videos in them
      min_width: 200                                         # the min resolution of horizontal resolution filter range (unit p)
      max_width: 4096                                         # the max resolution of horizontal resolution filter range (unit p)
      min_height: 200                                         # the min resolution of vertical resolution filter range (unit p)
      max_height: 4096                                        # the max resolution of vertical resolution filter range (unit p)
      any_or_all: any
  # Mapper ops
  - video_split_by_duration_mapper:                         # Mapper to split video by duration.
      split_duration: 10                                      # duration of each video split in seconds.
      min_last_split_duration: 0                              # the minimum allowable duration in seconds for the last video split. If the duration of the last split is less than this value, it will be discarded.
      keep_original_sample: true   
  - video_resize_aspect_ratio_mapper:
      min_ratio: 1
      max_ratio: 1.1
      strategy: increase
  - video_split_by_key_frame_mapper:                          # Mapper to split video by key frame.
      keep_original_sample: true                              # whether to keep the original sample. If it's set to False, there will be only cut sample in the final datasets and the original sample will be removed. It's True in default
  # Deduplicator ops
  - ray_video_deduplicator:                                 # the simple video deduplicator that can run on multi-nodes using md5 hashing exact matching method
      redis_host: '<head node ip>'                         # the host of the redis instance
      redis_port: <port-1>                           # the port of redis instance, please note that the default port of redis is 6379 which is the same as default port for ray, so we need to modify the default redis config to use it in other port

报错日志：

024-12-26 10:31:55 | WARNING | data_juicer.utils.resource_utils:28 - Command nvidia-smi is not found. There might be no GPUs on this machine.
2024-12-26 10:31:55 | INFO | data_juicer.core.ray_executor:42 - Initing Ray ...
2024-12-26 10:31:55,767 INFO worker.py:1586 -- Connecting to existing Ray cluster at address: 192.168.201.69:6379...
2024-12-26 10:31:55,772 INFO worker.py:1762 -- Connected to Ray cluster. View the dashboard at 192.168.201.69:8265
2024-12-26 10:31:55 | INFO | data_juicer.core.ray_executor:53 - Loading dataset with Ray...
2024-12-26 10:31:56,945 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-25_15-41-56_306491_3210611/logs/ray-data
2024-12-26 10:31:56,945 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON]
2024-12-26 10:31:57 | INFO | data_juicer.core.ray_executor:69 - Preparing process operators...
2024-12-26 10:31:57 | INFO | data_juicer.core.ray_executor:83 - Processing data...
2024-12-26 10:31:57 | INFO | data_juicer.core.ray_executor:87 - All Ops are done in 0.005s.
2024-12-26 10:31:57 | INFO | data_juicer.core.ray_executor:90 - Exporting dataset to disk...
2024-12-26 10:31:57,897 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-25_15-41-56_306491_3210611/logs/ray-data
2024-12-26 10:31:57,897 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON] -> TaskPoolMapOperator[MapBatches(partial)->MapBatches(process_batch_arrow)->MapBatches(compute_stats_single)->Filter(process_single)->MapBatches(compute_stats_single)->Filter(process_single)->MapBatches(process_batched)->MapBatches(process_single)->MapBatches(process_batched)->MapBatches(compute_stats_single)->Filter(process_single)->Write]
Running 0: 0%| | 0/1 [00:00<?, ?it/s]2024-12-26 10:32:36,127(ERROR streaming_executor_state.py:455 -- An exception was raised from a task of operator "MapBatches(partial)->MapBatches(process_batch_arrow)->MapBatches(compute_stats_single)->Filter(process_single)->MapBatches(compute_stats_single)->Filter(process_single)->MapBatches(process_batched)->MapBatches(process_single)->MapBatches(process_batched)->MapBatches(compute_stats_single)->Filter(process_single)->Write". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.

MapBatches(partial)->MapBatches(process_batch_arrow)->MapBatches(compute_stats_single)->Filter(process_single)->MapBatches(compute_stats_single)->Filter(process_single)->MapBatches(proc2024-12-26 10:32:36,131 ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
2024-12-26 10:32:36,137 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-25_15-41-56_306491_3210611/logs/ray-data
2024-12-26 10:32:36,137 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON]
2024-12-26 10:32:36,157 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-25_15-41-56_306491_3210611/logs/ray-data
2024-12-26 10:32:36,157 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON]
2024-12-26 10:32:36,176 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-25_15-41-56_306491_3210611/logs/ray-data
2024-12-26 10:32:36,176 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON]
2024-12-26 10:32:36,195 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-25_15-41-56_306491_3210611/logs/ray-data
2024-12-26 10:32:36,195 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON]
--- Logging error in Loguru Handler * minor changes on docs #2 ---
Record was: {'elapsed': datetime.timedelta(seconds=50, microseconds=957976), 'exception': (type=<class 'ray.exceptions.RayTaskError(FileNotFoundError)'>, value=RayTaskError(FileNotFoundError)(FileNotFoundError(2, "Failed to open local file '/root/data-juicer/outputs/demo/demo-processed-ray-videos/560_000000_000000.json'. Detail: [errno 2] No such file or directory")), traceback=<traceback object at 0x7f2e4ca9ef40>), 'extra': {}, 'file': (name='process_data.py', path='/root/data-juicer/tools/process_data.py'), 'function': '', 'level': (name='ERROR', no=40, icon='❌'), 'line': 19, 'message': "An error has been caught in function '', process 'MainProcess' (3261981), thread 'MainThread' (139837157271360):", 'module': 'process_data', 'name': 'main', 'process': (id=3261981, name='MainProcess'), 'thread': (id=139837157271360, name='MainThread'), 'time': datetime(2024, 12, 26, 10, 32, 36, 133482, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800), 'CST'))}
ray.data.exceptions.SystemException

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/root/data-juicer/djenv/lib/python3.9/site-packages/loguru/_logger.py", line 1297, in catch_wrapper
return function(*args, **kwargs)
File "/root/data-juicer/tools/process_data.py", line 15, in main
executor.run()
File "/root/data-juicer/data_juicer/core/ray_executor.py", line 91, in run
dataset.data.write_json(self.cfg.export_path, force_ascii=False)
File "/root/data-juicer/djenv/lib/python3.9/site-packages/ray/data/dataset.py", line 2828, in write_json
self.write_datasink(
File "/root/data-juicer/djenv/lib/python3.9/site-packages/ray/data/dataset.py", line 3544, in write_datasink
self._write_ds = Dataset(plan, logical_plan).materialize()
File "/root/data-juicer/djenv/lib/python3.9/site-packages/ray/data/dataset.py", line 4502, in materialize
copy._plan.execute()
File "/root/data-juicer/djenv/lib/python3.9/site-packages/ray/data/exceptions.py", line 86, in handle_trace
raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(FileNotFoundError): ray::MapBatches(partial)->MapBatches(process_batch_arrow)->MapBatches(compute_stats_single)->Filter(process_single)->MapBatches(compute_stats_single)->Filter(process_single)->MapBatches(process_batched)->MapBatches(process_single)->MapBatches(process_batched)->MapBatches(compute_stats_single)->Filter(process_single)->Write() (pid=1620848, ip=192.168.201.68)
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/root/djenv/lib/python3.9/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in call
yield from self._block_fn(input, ctx)
File "/root/data-juicer/djenv/lib/python3.9/site-packages/ray/data/_internal/planner/plan_write_op.py", line 26, in fn
File "/root/djenv/lib/python3.9/site-packages/ray/data/datasource/file_datasink.py", line 128, in write
self.write_block(block_accessor, 0, ctx)
File "/root/djenv/lib/python3.9/site-packages/ray/data/datasource/file_datasink.py", line 254, in write_block
call_with_retry(
File "/root/djenv/lib/python3.9/site-packages/ray/data/_internal/util.py", line 986, in call_with_retry
raise e from None
File "/root/djenv/lib/python3.9/site-packages/ray/data/_internal/util.py", line 973, in call_with_retry
return f()
File "/root/djenv/lib/python3.9/site-packages/ray/data/datasource/file_datasink.py", line 250, in write_block_to_path
with self.open_output_stream(write_path) as file:
File "/root/djenv/lib/python3.9/site-packages/ray/data/datasource/file_datasink.py", line 79, in open_output_stream
return self.filesystem.open_output_stream(path, **self.open_stream_args)
File "pyarrow/_fs.pyx", line 887, in pyarrow._fs.FileSystem.open_output_stream
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file '/root/data-juicer/outputs/demo/demo-processed-ray-videos/560_000000_000000.json'. Detail: [errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/data-juicer/djenv/lib/python3.9/site-packages/loguru/_handler.py", line 204, in emit
self._queue.put(str_record)
File "/usr/local/lib/python3.9/multiprocessing/queues.py", line 371, in put
obj = _ForkingPickler.dumps(obj)
File "/usr/local/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'ray.exceptions.RayTaskError(FileNotFoundError)'>: attribute lookup RayTaskError(FileNotFoundError) on ray.exceptions failed
--- End of logging error ---
2024-12-26 10:32:36,214 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-25_15-41-56_306491_3210611/logs/ray-data
2024-12-26 10:32:36,214 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON]
2024-12-26 10:32:36,233 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-25_15-41-56_306491_3210611/logs/ray-data
2024-12-26 10:32:36,233 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON]
2024-12-26 10:32:36,253 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-25_15-41-56_306491_3210611/logs/ray-data
2024-12-26 10:32:36,253 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON]
2024-12-26 10:32:36,272 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-25_15-41-56_306491_3210611/logs/ray-data
2024-12-26 10:32:36,272 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON]
ray.data.exceptions.SystemException

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/root/data-juicer/tools/process_data.py", line 19, in
main()
File "/root/data-juicer/djenv/lib/python3.9/site-packages/loguru/_logger.py", line 1297, in catch_wrapper
return function(*args, **kwargs)
File "/root/data-juicer/tools/process_data.py", line 15, in main
executor.run()
File "/root/data-juicer/data_juicer/core/ray_executor.py", line 91, in run
dataset.data.write_json(self.cfg.export_path, force_ascii=False)
File "/root/data-juicer/djenv/lib/python3.9/site-packages/ray/data/dataset.py", line 2828, in write_json
self.write_datasink(
File "/root/data-juicer/djenv/lib/python3.9/site-packages/ray/data/dataset.py", line 3544, in write_datasink
self._write_ds = Dataset(plan, logical_plan).materialize()
File "/root/data-juicer/djenv/lib/python3.9/site-packages/ray/data/dataset.py", line 4502, in materialize
copy._plan.execute()
File "/root/data-juicer/djenv/lib/python3.9/site-packages/ray/data/exceptions.py", line 86, in handle_trace
raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(FileNotFoundError): ray::MapBatches(partial)->MapBatches(process_batch_arrow)->MapBatches(compute_stats_single)->Filter(process_single)->MapBatches(compute_stats_single)->Filter(process_single)->MapBatches(process_batched)->MapBatches(process_single)->MapBatches(process_batched)->MapBatches(compute_stats_single)->Filter(process_single)->Write() (pid=1620848, ip=192.168.201.68)
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/root/djenv/lib/python3.9/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in call
yield from self._block_fn(input, ctx)
File "/root/data-juicer/djenv/lib/python3.9/site-packages/ray/data/_internal/planner/plan_write_op.py", line 26, in fn
File "/root/djenv/lib/python3.9/site-packages/ray/data/datasource/file_datasink.py", line 128, in write
self.write_block(block_accessor, 0, ctx)
File "/root/djenv/lib/python3.9/site-packages/ray/data/datasource/file_datasink.py", line 254, in write_block
call_with_retry(
File "/root/djenv/lib/python3.9/site-packages/ray/data/_internal/util.py", line 986, in call_with_retry
raise e from None
File "/root/djenv/lib/python3.9/site-packages/ray/data/_internal/util.py", line 973, in call_with_retry
return f()
File "/root/djenv/lib/python3.9/site-packages/ray/data/datasource/file_datasink.py", line 250, in write_block_to_path
with self.open_output_stream(write_path) as file:
File "/root/djenv/lib/python3.9/site-packages/ray/data/datasource/file_datasink.py", line 79, in open_output_stream
return self.filesystem.open_output_stream(path, **self.open_stream_args)
File "pyarrow/_fs.pyx", line 887, in pyarrow._fs.FileSystem.open_output_stream
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file '/root/data-juicer/outputs/demo/demo-processed-ray-videos/560_000000_000000.json'. Detail: [errno 2] No such file or directory

Additional 额外信息

在之前当ray的head节点安装了data-juicer，而ray的非head节点没有安装data-juicer时
运行ray分布式任务，在非head节点会报错：No module named data-juicer
请问跑分布式任务，是需要所有节点都安装data-juicer吗？

The text was updated successfully, but these errors were encountered:

pan-x-c · 2024-12-26T02:58:00Z

Ray 模式下 master 以及所有 worker 节点都必须安装 data-juicer，并且需要保证版本相同。
除此之外，Ray 在读写文件时默认所有节点有共享的文件系统(NFS)。
如果没有共享文件系统，则需要在文件读写路径前加上 local://前缀。这种情况下由于读写请求会集中在提交任务的节点，可能会导致性能大幅下降。
文件读写问题的更多问题可以参考 ray 的官方文档

awangzy · 2024-12-26T06:47:29Z

Ray 模式下 master 以及所有 worker 节点都必须安装 data-juicer，并且需要保证版本相同。除此之外，Ray 在读写文件时默认所有节点有共享的文件系统(NFS)。如果没有共享文件系统，则需要在文件读写路径前加上 local://前缀。这种情况下由于读写请求会集中在提交任务的节点，可能会导致性能大幅下降。文件读写问题的更多问题可以参考 ray 的官方文档

非常感谢！目前在ray_executor.py中给写出路径加上local://前缀，可以解决问题，如下所示：

另外我注意到在demo.yaml的export_path加上前缀，并不能正确被解析，local会被当成是一个路径

pan-x-c · 2024-12-26T06:51:15Z

请检查一下指向的文件夹是否存在

awangzy added the question Further information is requested label Dec 26, 2024

awangzy closed this as completed Dec 26, 2024

pan-x-c mentioned this issue Jan 14, 2025

[Bug]: Fail to run ray_bts_minhash_deduplicator #547

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in running distributed task on ray cluster #522

Error in running distributed task on ray cluster #522

awangzy commented Dec 26, 2024 •

edited

Loading

pan-x-c commented Dec 26, 2024

awangzy commented Dec 26, 2024

pan-x-c commented Dec 26, 2024

Error in running distributed task on ray cluster #522

Error in running distributed task on ray cluster #522

Comments

awangzy commented Dec 26, 2024 • edited Loading

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

配置：

Installation Method

Data-Juicer Version

Python Version

Ray Version

Ray Cluster信息

相关命令：

demos.yaml

报错日志：

Additional 额外信息

pan-x-c commented Dec 26, 2024

awangzy commented Dec 26, 2024

pan-x-c commented Dec 26, 2024

awangzy commented Dec 26, 2024 •

edited

Loading