去重50G左右中文语料卡死 #30

YaboSun · 2023-10-08T07:03:39Z

使用项目进行中文语料去重(minhash)等算子操作，在去重阶段卡死，约4000w条数据
详细配置如下：

project_name: 'CC100-zh'
dataset_path: xxx.jsonl
export_path: xxx-processed.jsonl

np: 50
open_tracer: true
text_keys: 'text'
process:
  - perplexity_filter:
      lang: zh
      max_ppl: 2500
  - document_minhash_deduplicator:                          
      tokenization: character                                    
      window_size: 5                                        
      num_permutations: 256                                   
      jaccard_threshold: 0.7                                 
      num_bands: null                                        
      num_rows_per_band: null                                 
      lowercase: true                                         
      ignore_pattern: null                                    
  - text_length_filter:
      min_len: 200
      max_len: 65589
  - character_repetition_filter:
      rep_len: 10
      max_ratio: 0.3
  - word_repetition_filter:
      lang: zh
      tokenization: true
      rep_len: 10
      max_ratio: 0.279

另外想问一下是否有处理类似规模语料详细的耗时统计？

The text was updated successfully, but these errors were encountered:

HYLcool · 2023-10-09T07:21:08Z

您好，感谢使用Data-Juicer！

因为当数据集样本量比较大时，去重的过程本身相对是比较慢的，有可能它只是在中间步骤的计算中而非卡死，所以如果方便的话，请您提供一下详细的“卡死”的现象可以吗？比如进度条停止了xx分钟，或者xx分钟没有新的log产生。能够有局部的log信息也对我们帮您解决这个问题是非常有帮助的。

至于您提到的第二个问题，我们之前主要处理了大量的英文语料，举个例子，我们处理过一个数据集大小约900GB，包含样本数目约3.6亿条，数据以及cache文件放置在nas上，一共15个算子（包含一个simhash去重算子），使用80个核进行处理，处理共耗时约2天零19个小时（其中去重消耗约11个小时），供您参考~中文语料数据的话，我们处理的数据集还比较有限，目前暂时没有一个比较好的例子。

YaboSun · 2023-10-11T06:20:37Z

您好，感谢使用Data-Juicer！

因为当数据集样本量比较大时，去重的过程本身相对是比较慢的，有可能它只是在中间步骤的计算中而非卡死，所以如果方便的话，请您提供一下详细的“卡死”的现象可以吗？比如进度条停止了xx分钟，或者xx分钟没有新的log产生。能够有局部的log信息也对我们帮您解决这个问题是非常有帮助的。

至于您提到的第二个问题，我们之前主要处理了大量的英文语料，举个例子，我们处理过一个数据集大小约900GB，包含样本数目约3.6亿条，数据以及cache文件放置在nas上，一共15个算子（包含一个simhash去重算子），使用80个核进行处理，处理共耗时约2天零19个小时（其中去重消耗约11个小时），供您参考~中文语料数据的话，我们处理的数据集还比较有限，目前暂时没有一个比较好的例子。

感谢回复，这个是我在处理过程中遇到的几个错误日志详情：

Traceback (most recent call last):
Exception ignored in: <function Dataset.__del__ at 0x7f33f4729280>
File "/kas_pyenv/user/data-juicer/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1274, in __del__
Traceback (most recent call last):
File "/kas_pyenv/user/data-juicer/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1274, in __del__
del self._data
del self._data
File "/kas_pyenv/user/data-juicer/lib/python3.8/site-packages/ray/_private/worker.py", line 1744, in sigterm_handler
File "/kas_pyenv/user/data-juicer/lib/python3.8/site-packages/ray/_private/worker.py", line 1744, in sigterm_handler
sys.exit(signum)
SystemExit: 15
sys.exit(signum)
SystemExit: 15

Traceback (most recent call last):
File "/kas_pyenv/user/data-juicer/lib/python3.8/weakref.py", line 103, in remove
def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):
File "/kas_pyenv/user/data-juicer/lib/python3.8/site-packages/ray/_private/worker.py", line 1744, in sigterm_handler
sys.exit(signum)
SystemExit: 15

chenhesen · 2023-10-16T03:14:02Z

您好，感谢使用Data-Juicer！
因为当数据集样本量比较大时，去重的过程本身相对是比较慢的，有可能它只是在中间步骤的计算中而非卡死，所以如果方便的话，请您提供一下详细的“卡死”的现象可以吗？比如进度条停止了xx分钟，或者xx分钟没有新的log产生。能够有局部的log信息也对我们帮您解决这个问题是非常有帮助的。
至于您提到的第二个问题，我们之前主要处理了大量的英文语料，举个例子，我们处理过一个数据集大小约900GB，包含样本数目约3.6亿条，数据以及cache文件放置在nas上，一共15个算子（包含一个simhash去重算子），使用80个核进行处理，处理共耗时约2天零19个小时（其中去重消耗约11个小时），供您参考~中文语料数据的话，我们处理的数据集还比较有限，目前暂时没有一个比较好的例子。

感谢回复，这个是我在处理过程中遇到的几个错误日志详情：
Traceback (most recent call last):
Exception ignored in: <function Dataset.__del__ at 0x7f33f4729280>
File "/kas_pyenv/user/data-juicer/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1274, in __del__
Traceback (most recent call last):
File "/kas_pyenv/user/data-juicer/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1274, in __del__
del self._data
del self._data
File "/kas_pyenv/user/data-juicer/lib/python3.8/site-packages/ray/_private/worker.py", line 1744, in sigterm_handler
File "/kas_pyenv/user/data-juicer/lib/python3.8/site-packages/ray/_private/worker.py", line 1744, in sigterm_handler
sys.exit(signum)
SystemExit: 15
sys.exit(signum)
SystemExit: 15
Traceback (most recent call last):
File "/kas_pyenv/user/data-juicer/lib/python3.8/weakref.py", line 103, in remove
def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):
File "/kas_pyenv/user/data-juicer/lib/python3.8/site-packages/ray/_private/worker.py", line 1744, in sigterm_handler
sys.exit(signum)
SystemExit: 15

您好，您这个问题应该是ray本身的bug导致的，其他地方有提出了相关的问题：https://github.com/ray-project/ray/issues/17745；
我们提交了一个pr，让ray在需要的时候才import：#35
目前在200g的语料上测试能够正常运行，您可以更新下代码再继续进行处理～

YaboSun · 2023-10-17T08:58:26Z

目前在200g的语料上测试能够正常运行，您可以更新下代码再继续进行处理～

感谢回复，目前拉取最新代码已经解决之前的问题！

另外涉及到ray，想问一下如何配置使用ray来做数据处理？有没有对应的执行示例？

pan-x-c · 2023-10-17T10:47:41Z

您好，如下链接提供了一个基于 ray 执行数据处理的简单 demo:

https://github.com/alibaba/data-juicer/tree/main/demos/process_on_ray

但需要注意当前版本的 Data-Juicer on Ray 仍处在实验阶段，相较单机版本 Data-Juicer 有较多功能缺失（仅支持读取 json 格式文件、不支持 Deduplicator 类型算子等）

YaboSun closed this as completed Oct 17, 2023

HYLcool added bug Something isn't working question Further information is requested labels Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

去重50G左右中文语料卡死 #30

去重50G左右中文语料卡死 #30

YaboSun commented Oct 8, 2023

HYLcool commented Oct 9, 2023

YaboSun commented Oct 11, 2023

chenhesen commented Oct 16, 2023

YaboSun commented Oct 17, 2023

pan-x-c commented Oct 17, 2023

去重50G左右中文语料卡死 #30

去重50G左右中文语料卡死 #30

Comments

YaboSun commented Oct 8, 2023

HYLcool commented Oct 9, 2023

YaboSun commented Oct 11, 2023

chenhesen commented Oct 16, 2023

YaboSun commented Oct 17, 2023

pan-x-c commented Oct 17, 2023