You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。
# Process config example for dataset
# global parameters
project_name: 'demo-process'
dataset_path: 'temp/WuDaoCorpus2.0_base_sample' # path to your dataset directory or file
np: 1 # number of subprocess to process your dataset
text_keys: 'content'
export_path: './outputs/demo-process/demo-processed.jsonl'
# process schedule
# a list of several process operators with their arguments
process:
- language_id_score_filter:
lang: 'zh'
min_score: 0.8
- document_simhash_deduplicator: # deduplicate text samples using SimHash-LSH method
tokenization: character # tokenization method for text. One of [space, punctuation, character]
window_size: 6 # window size of shingling
num_blocks: 10 # number of blocks in SimHash computing
hamming_distance: 8 # the max hamming distance to regard 2 samples as similar enough pair. Should be less than num_blocks always
- nlpcda_zh_mapper: # simply augment texts in Chinese based on the nlpaug library
sequential: false # whether combine all augmentation methods to a sequence. If it's True, a sample will be augmented by all opened augmentation methods sequentially. If it's False, each opened augmentation method would generate its augmented samples independently.
aug_num: 1 # number of augmented samples to be generated. If `sequential` is True, there will be total aug_num augmented samples generated. If it's False, there will be (aug_num * #opened_aug_method) augmented samples generated.
swap_random_char: true
Logs 报错日志
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/executor.py", line 120, in run
tmp = dataset.map(function=op.process,
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3004, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3397, in _map_single
writer.write_batch(batch)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_writer.py", line 551, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 243, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_writer.py", line 189, in __arrow_array__
out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
File "pyarrow/array.pxi", line 327, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
OverflowError: Python int too large to convert to C long
Before Reporting 报告之前
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
ubuntu
Installation Method 安装方式
pip
Data-Juicer Version Data-Juicer版本
0.1.2
Python Version Python版本
3.8
Describe the bug 描述这个bug
数据集:https://atp-modelzoo.oss-cn-hangzhou.aliyuncs.com/release/datasets/WuDaoCorpus2.0_base_sample.tgz
当
document_simhash_deduplicator
和nlpcda_zh_mapper
算子同时出现时会报错To Reproduce 如何复现
dj-process --config configs/demo/process.yaml
Configs 配置信息
Logs 报错日志
Screenshots 截图
No response
Additional 额外信息
应该和simhash值计算及arrow有关
pip list
The text was updated successfully, but these errors were encountered: