Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RAY error #107

Closed
3 tasks done
simplew2011 opened this issue Nov 30, 2023 · 6 comments · Fixed by #100
Closed
3 tasks done

[Bug]: RAY error #107

simplew2011 opened this issue Nov 30, 2023 · 6 comments · Fixed by #100
Assignees
Labels
bug Something isn't working

Comments

@simplew2011
Copy link

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

使用RAY对language_id_score_filter算子进行处理时报错。

To Reproduce 如何复现

# ok
python tools/process_data.py --config configs/demo/process.yaml

# error
python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

# ok,change op to - alphanumeric_filter:
python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

Configs 配置信息

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: 'demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
use_cache: false
export_path: './outputs/demo-process/demo-processed.jsonl'
save_stats_in_one_file: true
# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'
  # - alphanumeric_filter:

Logs 报错日志

(python3.8) wzp@vastai-NF5468M6:~/code/LLMData/open_source/data-juicer$ python tools/process_data.py --config configs/demo/process.yaml --executor_type ray
<class 'list'>
<class 'list'>
2023-11-30 15:32:52 | WARNING  | data_juicer.config.config:329 - Cache management of datasets is disabled.
2023-11-30 15:32:52 | WARNING  | data_juicer.config.config:340 - Set temp directory to store temp files to [None].
2023-11-30 15:32:52 | INFO     | data_juicer.config.config:442 - Back up the input config file [/home/wzp/code/LLMData/open_source/data-juicer/configs/demo/process.yaml] into the work_dir [./outputs/demo-process]
2023-11-30 15:32:52 | INFO     | data_juicer.config.config:463 - Configuration table: 
╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════╕
│ key                    │ values                                                                                   │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════╡
│ config                 │ [Path_fr(configs/demo/process.yaml, cwd=/home/wzp/code/LLMData/open_source/data-juicer)] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ hpo_config             │ None                                                                                     │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ project_name           │ 'demo-process'                                                                           │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ executor_type          │ 'ray'                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_path           │ 'demos/data/demo-dataset.jsonl'                                                          │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ export_path            │ './outputs/demo-process/demo-processed.jsonl'                                            │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size      │ 0                                                                                        │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel     │ False                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ np                     │ 4                                                                                        │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ text_keys              │ 'text'                                                                                   │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ image_key              │ 'images'                                                                                 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ image_special_token    │ '<__dj__image>'                                                                          │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token      │ '<|__dj__eoc|>'                                                                          │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ suffixes               │ []                                                                                       │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ use_cache              │ False                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir           │ PosixPath('/home/wzp/.cache/huggingface/datasets')                                       │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ cache_compress         │ None                                                                                     │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint         │ False                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ temp_dir               │ None                                                                                     │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ open_tracer            │ False                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace       │ []                                                                                       │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ trace_num              │ 10                                                                                       │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ op_fusion              │ False                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ process                │ [{'language_id_score_filter': {'image_key': 'images',                                    │
│                        │                                'lang': 'zh',                                             │
│                        │                                'min_score': 0.8,                                         │
│                        │                                'text_key': 'text'}}]                                     │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file │ True                                                                                     │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ ray_address            │ 'auto'                                                                                   │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ work_dir               │ './outputs/demo-process'                                                                 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ timestamp              │ '20231130153252'                                                                         │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir            │ '/home/wzp/code/LLMData/open_source/data-juicer/demos/data'                              │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ add_suffix             │ False                                                                                    │
╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════╛
2023-11-30 15:32:53 | INFO     | data_juicer.core.ray_executor:35 - Initing Ray ...
2023-11-30 15:32:53,326 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.23.4.252:6379...
2023-11-30 15:32:53,333 INFO worker.py:1642 -- Connected to Ray cluster.
2023-11-30 15:32:53 | INFO     | data_juicer.core.ray_executor:47 - Loading dataset with Ray...
2023-11-30 15:32:54,324 INFO read_api.py:406 -- To satisfy the requested parallelism of 192, each read task output is split into 192 smaller blocks.
2023-11-30 15:32:54 | INFO     | data_juicer.core.ray_executor:51 - Preparing process operators...
2023-11-30 15:32:54 | INFO     | data_juicer.utils.model_utils:87 - Loading fasttext language identification model...
2023-11-30 15:32:54 | INFO     | data_juicer.core.ray_executor:59 - columns ['text', 'meta']
2023-11-30 15:32:54 | INFO     | data_juicer.core.ray_executor:62 - Processing data...
2023-11-30 15:32:54,702 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON->SplitBlocks(192)] -> TaskPoolMapOperator[MapBatches(process_batch)->Map(compute_stats)->Filter(process)]
2023-11-30 15:32:54,702 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-11-30 15:32:54,702 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:00.312 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.                                                                              
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:00.363 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.                                                                              
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:01.362 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.                                                                              
--- Logging error in Loguru Handler #1 ---                                                                                                                                                                                                                                                            
Record was: {'elapsed': datetime.timedelta(seconds=13, microseconds=527897), 'exception': (type=<class 'ray.exceptions.RayTaskError(ValueError)'>, value=RayTaskError(ValueError)(ValueError('Model not loaded. Please retry later.')), traceback=<traceback object at 0x7f298c1324c0>), 'extra': {}, 'file': (name='process_data.py', path='tools/process_data.py'), 'function': '<module>', 'level': (name='ERROR', no=40, icon='❌'), 'line': 19, 'message': "An error has been caught in function '<module>', process 'MainProcess' (48135), thread 'MainThread' (139830410995520):", 'module': 'process_data', 'name': '__main__', 'process': (id=48135, name='MainProcess'), 'thread': (id=139830410995520, name='MainThread'), 'time': datetime(2023, 11, 30, 15, 33, 2, 531776, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800), 'CST'))}
Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 345, in ray._raylet.StreamingObjectRefGenerator._next_sync
  File "python/ray/_raylet.pyx", line 4533, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
  File "python/ray/_raylet.pyx", line 443, in ray._raylet.check_status
ray.exceptions.ObjectRefStreamEndOfStreamError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_waitable_ready
    meta = ray.get(next(self._streaming_gen))
  File "python/ray/_raylet.pyx", line 300, in ray._raylet.StreamingObjectRefGenerator.__next__
  File "python/ray/_raylet.pyx", line 351, in ray._raylet.StreamingObjectRefGenerator._next_sync
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_logger.py", line 1277, in catch_wrapper
    return function(*args, **kwargs)
  File "tools/process_data.py", line 15, in main
    executor.run()
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/ray_executor.py", line 83, in run
    logger.info(f'Op [{op_name}] Done. Left '
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 2498, in count
    [get_num_rows.remote(block) for block in self.get_internal_block_refs()]
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 4799, in get_internal_block_refs
    blocks = self._plan.execute().get_blocks()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/plan.py", line 591, in execute
    blocks = execute_to_legacy_block_list(
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
    block_list = _bundles_to_block_list(bundles)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 357, in _bundles_to_block_list
    for ref_bundle in bundles:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
    return self.get_next()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 129, in get_next
    raise item
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 187, in run
    while self._scheduling_loop_step(self._topology) and not self._shutdown:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 235, in _scheduling_loop_step
    process_completed_tasks(topology)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 333, in process_completed_tasks
    active_tasks[ref].on_waitable_ready()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_waitable_ready
    ex = ray.get(block_ref)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::MapBatches(process_batch)->Map(compute_stats)->Filter(process)() (pid=48715, ip=10.23.4.252)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 405, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
    for data in iter:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
    yield from self._row_fn(input, ctx)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 256, in transform_fn
    for row in rows:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 223, in __call__
    for block in blocks:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
    for data in iter:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
    yield from self._row_fn(input, ctx)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 233, in transform_fn
    out_row = fn(row)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 119, in fn
    return op_fn(item, *fn_args, **fn_kwargs)
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
    return f(*args, **kargs)
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 53, in compute_stats
    raise ValueError(err_msg)
ValueError: Model not loaded. Please retry later.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_handler.py", line 204, in emit
    self._queue.put(str_record)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/queues.py", line 362, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'ray.exceptions.RayTaskError(ValueError)'>: attribute lookup RayTaskError(ValueError) on ray.exceptions failed
--- End of logging error ---
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48713) 2023-11-30 15:33:02.374 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later. [repeated 9x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)

Screenshots 截图

No response

Additional 额外信息

No response

@simplew2011 simplew2011 added the bug Something isn't working label Nov 30, 2023
@HYLcool
Copy link
Collaborator

HYLcool commented Nov 30, 2023

@simplew2011

麻烦你先check一下language_id_score_filter这个算子所需的模型是否成功下载到本地并确认其完整性以及正确性,模型存放目录默认为~/.cache/data_juicer/models,该算子所需的模型应该为目录下的lid.176.bin文件,其大小为131,266,198字节,其md5为01810bc59c6a3d2b79c79e6336612f65

如发现模型存在问题,你可以将问题文件删除后再次运行dj,它会自动进行下载(可能需要花费一些时间)

@simplew2011
Copy link
Author

image

手动删除,自动下载,重新跑

ok

python tools/process_data.py --config configs/demo/process.yaml

error

python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

@HYLcool HYLcool self-assigned this Nov 30, 2023
@HYLcool HYLcool moved this from Todo to In Progress in data-juicer Nov 30, 2023
@simplew2011
Copy link
Author

  • perplexity_filter: 在RAY模式下似乎也不行的

AttributeError: 'NoneType' object has no attribute 'score'

@HYLcool
Copy link
Collaborator

HYLcool commented Dec 5, 2023

@simplew2011

在RAY模式下这些有模型依赖的OP不可用的问题我们正在 #100 中修复,待review通过merge到main分支后就ok了,到时候我们会告知你的~

@HYLcool HYLcool linked a pull request Dec 5, 2023 that will close this issue
@github-project-automation github-project-automation bot moved this from In Progress to Done in data-juicer Dec 6, 2023
@simplew2011
Copy link
Author

simplew2011 commented Dec 12, 2023

似乎不行,可否验证下@HYLcool,配置表是默认的:
python tools/process_data.py --config demos/process_on_ray/configs/demo.yaml

outputs.zip

2023-12-12 11:07:12.314 | INFO     | data_juicer.core.ray_executor:run:62 - Processing data...
2023-12-12 11:07:20.569 | INFO     | data_juicer.core.ray_executor:run:83 - Op [alphanumeric_filter] Done. Left 11 samples.
2023-12-12 11:07:20.915 | INFO     | data_juicer.core.ray_executor:run:83 - Op [average_line_length_filter] Done. Left 10 samples.
2023-12-12 11:07:21.632 | INFO     | data_juicer.core.ray_executor:run:83 - Op [character_repetition_filter] Done. Left 10 samples.
2023-12-12 11:07:22.428 | INFO     | data_juicer.core.ray_executor:run:83 - Op [flagged_words_filter] Done. Left 10 samples.
2023-12-12 11:07:23.321 | INFO     | data_juicer.core.ray_executor:run:83 - Op [language_id_score_filter] Done. Left 3 samples.
2023-12-12 11:07:24.115 | INFO     | data_juicer.core.ray_executor:run:83 - Op [maximum_line_length_filter] Done. Left 3 samples.
2023-12-12 11:07:24.898 | INFO     | data_juicer.core.ray_executor:run:83 - Op [perplexity_filter] Done. Left 3 samples.
2023-12-12 11:07:25.818 | INFO     | data_juicer.core.ray_executor:run:83 - Op [special_characters_filter] Done. Left 3 samples.
2023-12-12 11:07:26.631 | INFO     | data_juicer.core.ray_executor:run:83 - Op [stopwords_filter] Done. Left 3 samples.
2023-12-12 11:07:27.464 | INFO     | data_juicer.core.ray_executor:run:83 - Op [text_length_filter] Done. Left 3 samples.
2023-12-12 11:07:28.243 | INFO     | data_juicer.core.ray_executor:run:83 - Op [words_num_filter] Done. Left 1 samples.
2023-12-12 11:07:29.052 | INFO     | data_juicer.core.ray_executor:run:83 - Op [word_repetition_filter] Done. Left 1 samples.
2023-12-12 11:07:29.053 | INFO     | data_juicer.core.ray_executor:run:87 - Exporting dataset to disk...
2023-12-12 11:07:31.917 | ERROR    | __main__:<module>:19 - An error has been caught in function '<module>', process 'MainProcess' (41651), thread 'MainThread' (140511941588800):
Traceback (most recent call last):

  File "python/ray/_raylet.pyx", line 347, in ray._raylet.StreamingObjectRefGenerator._next_sync
  File "python/ray/_raylet.pyx", line 4643, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
  File "python/ray/_raylet.pyx", line 447, in ray._raylet.check_status

ray.exceptions.ObjectRefStreamEndOfStreamError


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_data_ready
    meta = ray.get(next(self._streaming_gen))
           │   │        │    └ <ray._raylet.StreamingObjectRefGenerator object at 0x7fc8746a4ca0>
           │   │        └ <ray.data._internal.execution.interfaces.physical_operator.DataOpTask object at 0x7fc8746a4340>
           │   └ <function get at 0x7fc88c7e7820>
           └ <module 'ray' from '/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/__init__.py'>
  File "python/ray/_raylet.pyx", line 302, in ray._raylet.StreamingObjectRefGenerator.__next__
  File "python/ray/_raylet.pyx", line 353, in ray._raylet.StreamingObjectRefGenerator._next_sync

StopIteration


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

> File "tools/process_data.py", line 19, in <module>
    main()
    └ <function main at 0x7fcb7b0f14c0>

  File "tools/process_data.py", line 15, in main
    executor.run()
    │        └ <function RayExecutor.run at 0x7fc88c7e7dc0>
    └ <data_juicer.core.ray_executor.RayExecutor object at 0x7fc88e9177f0>

  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/ray_executor.py", line 88, in run
    dataset.write_json(self.cfg.export_path, force_ascii=False)
    │       │          │    │   └ './outputs/demo/demo-processed'
    │       │          │    └ Namespace(add_suffix=False, alphanumeric_filter=Namespace(image_key=None, max_ratio=9223372036854775807, min_ratio=0.25, text...
    │       │          └ <data_juicer.core.ray_executor.RayExecutor object at 0x7fc88e9177f0>
    │       └ <function Dataset.write_json at 0x7fc88c4b61f0>
    └ Dataset(
         num_blocks=192,
         num_rows=1,
         schema={
            text: string,
            __dj...: struct<alnum_ratio: double, avg_lin...

  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 2821, in write_json
    self.write_datasource(
    │    └ <function Dataset.write_datasource at 0x7fc88c4b6940>
    └ Dataset(
         num_blocks=192,
         num_rows=1,
         schema={
            text: string,
            __dj...: struct<alnum_ratio: double, avg_lin...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 3457, in write_datasource
    self._write_ds = Dataset(plan, logical_plan).materialize()
    │    │           │       │     └ <ray.data._internal.logical.interfaces.logical_plan.LogicalPlan object at 0x7fc8747ade20>
    │    │           │       └ ExecutionPlan(dataset_uuid=ae6c733618d049d495d01de1f9bd255e, run_by_consumer=False, in_blocks=LazyBlockList(owned_by_consumer...
    │    │           └ <class 'ray.data.dataset.Dataset'>
    │    └ None
    └ Dataset(
         num_blocks=192,
         num_rows=1,
         schema={
            text: string,
            __dj...: struct<alnum_ratio: double, avg_lin...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 4502, in materialize
    copy._plan.execute(force_read=True)
    │    │     └ <function ExecutionPlan.execute at 0x7fc88c5485e0>
    │    └ ExecutionPlan(dataset_uuid=9cc683010ce54ac1b990d2aacc2f72af, run_by_consumer=False, in_blocks=LazyBlockList(owned_by_consumer...
    └ Write
      +- MaterializedDataset(
            num_blocks=192,
            num_rows=1,
            schema={
               text: string,
               _...: st...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/plan.py", line 599, in execute
    blocks = execute_to_legacy_block_list(
             └ <function execute_to_legacy_block_list at 0x7fc88c55a3a0>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
    block_list = _bundles_to_block_list(bundles)
                 │                      └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
                 └ <function _bundles_to_block_list at 0x7fc88c55a700>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 356, in _bundles_to_block_list
    for ref_bundle in bundles:
        │             └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
        └ RefBundle(blocks=((ObjectRef(f1e4ccbdc9f0fac3ffffffffffffffffffffffff0500000002000000), BlockMetadata(num_rows=1, size_bytes=...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
    return self.get_next()
           │    └ <function StreamingExecutor.execute.<locals>.StreamIterator.get_next at 0x7fc87475f700>
           └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 141, in get_next
    raise item
          └ RayTaskError(FileNotFoundError)(FileNotFoundError(2, "Failed to open local file './outputs/demo/demo-processed/c4e26ca89c5d4a...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 201, in run
    while self._scheduling_loop_step(self._topology) and not self._shutdown:
          │    │                     │    │                  │    └ True
          │    │                     │    │                  └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
          │    │                     │    └ {InputDataBuffer[Input]: <ray.data._internal.execution.streaming_executor_state.OpState object at 0x7fc874686e20>, InputDataB...
          │    │                     └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
          │    └ <function StreamingExecutor._scheduling_loop_step at 0x7fc88c519790>
          └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 252, in _scheduling_loop_step
    process_completed_tasks(topology, self._backpressure_policies)
    │                       │         │    └ []
    │                       │         └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
    │                       └ {InputDataBuffer[Input]: <ray.data._internal.execution.streaming_executor_state.OpState object at 0x7fc874686e20>, InputDataB...
    └ <function process_completed_tasks at 0x7fc88c519160>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 365, in process_completed_tasks
    num_blocks_read = task.on_data_ready(
                      │    └ <function DataOpTask.on_data_ready at 0x7fc88c7050d0>
                      └ <ray.data._internal.execution.interfaces.physical_operator.DataOpTask object at 0x7fc8746a4340>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_data_ready
    ex = ray.get(block_ref)
         │   │   └ ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000)
         │   └ <function get at 0x7fc88c7e7820>
         └ <module 'ray' from '/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/__init__.py'>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
           │   │       └ {}
           │   └ (ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000),)
           └ <function get at 0x7fc88c856790>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           │     │       └ {}
           │     └ (ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000),)
           └ <function get at 0x7fc88c856700>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/worker.py", line 2563, in get
    raise value.as_instanceof_cause()
          │     └ <function RayTaskError.as_instanceof_cause at 0x7fc88cec3700>
          └ RayTaskError('ray.data._internal.execution.operators.map_operator._map_task', 'Traceback (most recent call last):\n  File "py...

ray.exceptions.RayTaskError(FileNotFoundError): �[36mray::MapBatches(process_batch)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Write()�[39m (pid=42507, ip=10.23.4.252)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 416, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 232, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_write_op.py", line 27, in fn
    {"write_result": [datasource.write(blocks, ctx, **write_args)]}
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 289, in write
    with _open_file_with_retry(
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 881, in _open_file_with_retry
    raise e from None
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 863, in _open_file_with_retry
    return open_file()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 291, in <lambda>
    lambda: fs.open_output_stream(write_path, **open_stream_args),
  File "pyarrow/_fs.pyx", line 868, in pyarrow._fs.FileSystem.open_output_stream
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file './outputs/demo/demo-processed/c4e26ca89c5d4af1a1c63c57d3dc2875_000000_000000.json'. Detail: [errno 2] No such file or directory

@simplew2011 simplew2011 mentioned this issue Dec 12, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants