[Bug]: OverflowError: Python int too large to convert to C long #168

simplew2011 · 2024-01-03T12:12:48Z

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

数据集：https://atp-modelzoo.oss-cn-hangzhou.aliyuncs.com/release/datasets/WuDaoCorpus2.0_base_sample.tgz

当document_simhash_deduplicator和nlpcda_zh_mapper算子同时出现时会报错

To Reproduce 如何复现

dj-process --config configs/demo/process.yaml

Configs 配置信息

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: 'temp/WuDaoCorpus2.0_base_sample'  # path to your dataset directory or file
np: 1  # number of subprocess to process your dataset
text_keys: 'content'
export_path: './outputs/demo-process/demo-processed.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'
      min_score: 0.8

  - document_simhash_deduplicator:                          # deduplicate text samples using SimHash-LSH method
      tokenization: character                                     # tokenization method for text. One of [space, punctuation, character]
      window_size: 6                                          # window size of shingling
      num_blocks: 10                                           # number of blocks in SimHash computing
      hamming_distance: 8                                     # the max hamming distance to regard 2 samples as similar enough pair. Should be less than num_blocks always

  - nlpcda_zh_mapper:                                       # simply augment texts in Chinese based on the nlpaug library
      sequential: false                                       # whether combine all augmentation methods to a sequence. If it's True, a sample will be augmented by all opened augmentation methods sequentially. If it's False, each opened augmentation method would generate its augmented samples independently.
      aug_num: 1                                              # number of augmented samples to be generated. If `sequential` is True, there will be total aug_num augmented samples generated. If it's False, there will be (aug_num * #opened_aug_method) augmented samples generated.
      swap_random_char: true

Logs 报错日志

  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/executor.py", line 120, in run
    tmp = dataset.map(function=op.process,
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 180, in map
    new_ds = NestedDataset(super().map(*args, **kargs))
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3004, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3397, in _map_single
    writer.write_batch(batch)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_writer.py", line 551, in write_batch
    arrays.append(pa.array(typed_sequence))
  File "pyarrow/array.pxi", line 243, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_writer.py", line 189, in __arrow_array__
    out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
  File "pyarrow/array.pxi", line 327, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
OverflowError: Python int too large to convert to C long

Screenshots 截图

No response

Additional 额外信息

应该和simhash值计算及arrow有关

pip list

about-time                    4.2.1
accelerate                    0.25.0
ago                           0.0.95
aiofiles                      23.2.1
aiohttp                       3.8.6
aiosignal                     1.3.1
alabaster                     0.7.13
albumentations                1.3.1
alive-progress                3.1.4
altair                        5.1.2
antlr4-python3-runtime        4.9.3
anyio                         3.7.1
appdirs                       1.4.4
APScheduler                   3.9.1
argcomplete                   1.10.3
argos-translate-files         1.1.4
argostranslate                1.9.1
arxiv                         2.0.0
arxiv-dl                      1.1.5
arXiv-download                0.1
astunparse                    1.6.3
async-timeout                 4.0.3
attrs                         23.1.0
Automat                       22.10.0
Babel                         2.13.1
backports.zoneinfo            0.2.1
ballpark                      1.4.0
beautifulsoup4                4.9.3
bert4torch                    0.4.0
blinker                       1.6.3
blis                          0.7.11
boto3                         1.28.73
botocore                      1.31.73
bs4                           0.0.1
cachelib                      0.10.2
cachetools                    5.3.2
catalogue                     2.0.10
certifi                       2022.12.7
cffi                          1.16.0
cfgv                          3.4.0
chardet                       3.0.4
charset-normalizer            2.1.1
click                         8.1.7
cloudpickle                   3.0.0
cmake                         3.25.0
colorama                      0.4.6
coloredlogs                   15.0.1
colorlog                      6.7.0
commonmark                    0.9.1
compressed-rtf                1.0.6
confection                    0.1.3
constantly                    23.10.4
contourpy                     1.1.1
courlan                       0.9.4
cryptography                  41.0.5
cssselect                     1.2.0
ctranslate2                   3.20.0
cycler                        0.12.1
cymem                         2.0.8
Cython                        3.0.6
dashscope                     1.10.0
datasets                      2.11.0
datasketch                    1.6.4
dateparser                    1.1.8
deep-translator               1.11.4
defusedxml                    0.7.1
Deprecated                    1.2.14
dill                          0.3.4
distlib                       0.3.7
distro                        1.8.0
dl-translate                  0.3.0
docker-pycreds                0.4.0
docopt                        0.6.2
docstring-parser              0.15
docutils                      0.18.1
docx2txt                      0.8
dotmap                        1.3.30
ebcdic                        1.1.1
elastic-transport             8.10.0
elasticsearch                 8.10.1
emoji                         2.2.0
environs                      9.5.0
et-xmlfile                    1.1.0
exceptiongroup                1.1.3
expiringdict                  1.2.2
extract-msg                   0.28.7
fake-useragent                1.3.0
fastapi                       0.105.0
fasttext-wheel                0.9.2
faust-cchardet                2.1.19
feedfinder2                   0.0.4
feedparser                    6.0.10
ffmpy                         0.3.1
filelock                      3.12.4
fire                          0.5.0
flagdata                      1.0.0
Flask                         2.2.2
flask-babel                   3.1.0
Flask-Limiter                 2.6.3
Flask-Session                 0.4.0
flask-swagger                 0.2.14
flask-swagger-ui              4.11.1
flatbuffers                   23.5.26
fonttools                     4.43.1
frozenlist                    1.4.0
fsspec                        2023.3.0
ftfy                          6.1.1
gdown                         4.7.1
gevent                        23.9.1
ghp-import                    2.1.0
gitdb                         4.0.10
GitPython                     3.1.40
gne                           0.3.0
google-trans-new              1.1.9
googletrans                   4.0.0rc1
GPUtil                        1.4.0
gradio                        3.50.2
gradio_client                 0.6.1
grapheme                      0.6.0
greenlet                      3.0.1
grpcio                        1.59.2
h11                           0.9.0
h2                            3.2.0
h5py                          3.10.0
hanziconv                     0.3.2
hjson                         3.1.0
hpack                         3.0.0
hstspreload                   2023.1.1
html5tagger                   1.3.0
htmldate                      1.5.2
httpcore                      0.9.1
httptools                     0.6.1
httpx                         0.13.3
huggingface-hub               0.17.3
humanfriendly                 10.0
hurry.filesize                0.9
hydra-core                    1.3.2
hyperframe                    5.2.0
hyperlink                     21.0.0
identify                      2.5.30
idna                          2.10
imagededup                    0.3.2
imageio                       2.31.6
imagesize                     1.4.1
IMAPClient                    2.1.0
importlib-metadata            6.8.0
importlib-resources           6.1.0
incremental                   22.10.0
install                       1.3.5
itemadapter                   0.8.0
itemloaders                   1.1.0
itsdangerous                  2.1.2
jieba                         0.42.1
jieba3k                       0.35.1
Jinja2                        3.1.2
jiojio                        1.2.5
jionlp                        1.5.4
jmespath                      1.0.1
joblib                        1.3.2
jsonargparse                  4.27.1
jsonlines                     4.0.0
jsonschema                    4.19.1
jsonschema-specifications     2023.7.1
jusText                       3.0.0
kenlm                         0.2.0
Keras                         2.3.1
Keras-Applications            1.0.8
Keras-Preprocessing           1.1.2
kiwisolver                    1.4.5
langcodes                     3.3.0
langdetect                    1.0.9
langid                        1.1.6
lazy_loader                   0.3
Levenshtein                   0.23.0
LexiLang                      1.0.1
libretranslate                1.5.2
libretranslatepy              2.1.3
lightning                     2.1.0
lightning-utilities           0.9.0
limits                        3.7.0
lingua-language-detector      2.0.0
linkify-it-py                 2.0.2
lit                           15.0.7
livereload                    2.6.3
llvmlite                      0.41.1
loguru                        0.7.2
lxml                          4.9.3
lz4                           4.3.2
Markdown                      3.5.1
markdown-it-py                3.0.0
markdown2                     2.4.11
MarkupSafe                    2.1.2
marshmallow                   3.20.1
matplotlib                    3.7.3
mdit-py-plugins               0.4.0
mdurl                         0.1.2
memray                        1.11.0
mergedeep                     1.3.4
mkdocs                        1.5.3
mkdocs-material-extensions    1.3.1
mlscraper                     0.1.2
more-itertools                10.1.0
Morfessor                     2.0.6
mpmath                        1.3.0
msgpack                       1.0.7
multidict                     6.0.4
multiprocess                  0.70.12
munch                         4.0.0
murmurhash                    1.0.10
networkx                      3.0
news-please                   1.5.35
newspaper3k                   0.2.8
newspaper3kli                 0.1.0
nh3                           0.2.15
nicefid                       2.1.1
nlpaug                        1.1.11
nlpcda                        2.5.8
nltk                          3.8.1
nodeenv                       1.8.0
Nuitka                        2.0rc6
numba                         0.58.1
numpy                         1.24.1
nvidia-cublas-cu11            11.10.3.66
nvidia-cuda-cupti-cu11        11.7.101
nvidia-cuda-nvrtc-cu11        11.7.99
nvidia-cuda-runtime-cu11      11.7.99
nvidia-cudnn-cu11             8.5.0.96
nvidia-cufft-cu11             10.9.0.58
nvidia-curand-cu11            10.2.10.91
nvidia-cusolver-cu11          11.4.0.1
nvidia-cusparse-cu11          11.7.4.91
nvidia-nccl-cu11              2.14.3
nvidia-nvtx-cu11              11.7.91
olefile                       0.46
omegaconf                     2.3.0
onnx                          1.15.0
onnxruntime                   1.16.3
onnxsim                       0.4.35
openai                        0.28.0
OpenCC                        1.1.6
opencv-python-headless        4.8.1.78
openpyxl                      3.1.2
ordered-set                   4.1.0
orjson                        3.9.10
outcome                       1.3.0.post0
packaging                     23.1
paginate                      0.5.6
pandas                        2.0.0
parse                         1.19.1
parsel                        1.8.1
pathos                        0.3.1
pathspec                      0.12.1
pathtools                     0.1.2
pathy                         0.10.3
patsy                         0.5.3
pdfmajor                      1.3.13
pdfminer.six                  20221105
pdfplumber                    0.10.2
pdfsyntax                     0.0.7
pdfx                          1.4.1
peft                          0.7.0
Pillow                        9.3.0
pip                           23.3.2
pkgutil_resolve_name          1.3.10
plac                          1.4.1
platformdirs                  3.11.0
plotly                        5.18.0
polib                         1.1.1
polyglot                      16.7.4
pox                           0.3.3
ppft                          1.7.6.7
pre-commit                    3.5.0
preshed                       3.0.9
prettytable                   3.9.0
prometheus-client             0.15.0
prompt-toolkit                3.0.41
Protego                       0.3.0
protobuf                      4.24.4
psutil                        5.9.6
psycopg2-binary               2.9.9
py-data-juicer                0.1.2        /home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages
py-spy                        0.3.14
py4j                          0.10.9.7
pyarrow                       12.0.0
pyasn1                        0.5.0
pyasn1-modules                0.3.0
pybind11                      2.11.1
pycld2                        0.41
pycorrector                   1.0.0
pycparser                     2.21
pycryptodome                  3.8.2
pydantic                      1.10.13
pydeck                        0.8.1b0
PyDispatcher                  2.0.7
pydub                         0.25.1
pyee                          8.2.2
PyExecJS                      1.5.1
pyfreeproxy                   0.1.4
Pygments                      2.16.1
pygoogletranslation           2.0.6
pyhostman                     0.1.3
PyICU                         2.12
pymdown-extensions            10.5
PyMySQL                       1.1.0
pynvml                        11.4.1
pyOpenSSL                     23.3.0
pypandoc                      1.12
pyparsing                     3.1.1
pypdf                         3.17.0
PyPDF2                        3.0.1
pypdfium2                     4.22.0
pyphen                        0.14.0
pypinyin                      0.49.0
pyppeteer                     1.0.2
pyquery                       2.0.0
PySocks                       1.7.1
pyspark                       3.5.0
python-dateutil               2.8.2
python-docx                   1.0.1
python-dotenv                 1.0.0
python-hosts                  1.0.5
python-Levenshtein            0.23.0
python-multipart              0.0.6
python-pptx                   0.6.22
pytorch-lightning             2.0.6
pytz                          2023.3.post1
PyWavelets                    1.4.1
PyYAML                        6.0.1
pyyaml_env_tag                0.1
qudida                        0.0.4
queuelib                      1.6.2
rapidfuzz                     3.4.0
ray                           2.9.0
readability                   0.3.1
readability-lxml              0.8.1
recommonmark                  0.7.1
redis                         4.3.4
referencing                   0.30.2
regex                         2023.10.3
requests                      2.28.1
requests-file                 1.5.1
requests-html                 0.10.0
resize-right                  0.0.2
responses                     0.18.0
rfc3986                       1.5.0
rich                          12.6.0
rjieba                        0.1.11
roformer                      0.4.3
rpds-py                       0.10.6
ruamel.yaml                   0.18.3
ruamel.yaml.clib              0.2.8
s3transfer                    0.7.0
sacremoses                    0.0.53
safetensors                   0.4.0
sanic                         23.6.0
sanic-routing                 23.6.0
scalene                       1.5.31.1
schedule                      1.2.1
scikit-image                  0.21.0
scikit-learn                  1.3.2
scipdf                        0.1.dev0
scipy                         1.10.1
sconf                         0.2.5
scrapeasy                     0.12
Scrapy                        2.11.0
selectolax                    0.3.17
selenium                      4.14.0
semantic-version              2.10.0
sentencepiece                 0.1.99
sentry-sdk                    1.32.0
service-identity              23.1.0
setproctitle                  1.3.3
setuptools                    68.0.0
sgmllib3k                     1.0.0
shortuuid                     1.0.11
simhash-py                    0.4.0
six                           1.12.0
smart-open                    6.4.0
smmap                         5.0.1
sniffio                       1.3.0
snowballstemmer               2.2.0
sortedcontainers              2.4.0
soupsieve                     2.5
spacy                         3.5.0
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5
spacy-pkuseg                  0.0.33
SpeechRecognition             3.8.1
Sphinx                        7.1.2
sphinx-autobuild              2021.3.14
sphinx-rtd-theme              1.3.0
sphinxcontrib-applehelp       1.0.4
sphinxcontrib-devhelp         1.0.2
sphinxcontrib-htmlhelp        2.0.1
sphinxcontrib-jquery          4.1
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.3
sphinxcontrib-serializinghtml 1.1.5
SQLAlchemy                    2.0.23
srsly                         2.4.8
stanza                        1.1.1
starlette                     0.27.0
statsmodels                   0.14.0
streamlit                     1.27.2
svgwrite                      1.4.3
sympy                         1.12
tabulate                      0.9.0
tblib                         3.0.0
tenacity                      8.2.3
termcolor                     2.3.0
textstat                      0.7.3
textual                       0.46.0
thinc                         8.1.12
threadpoolctl                 3.2.0
tifffile                      2023.7.10
tiktoken                      0.5.1
timm                          0.5.4
tinysegmenter                 0.3
tld                           0.13
tldextract                    5.0.1
tokenizers                    0.15.0
toml                          0.10.2
toolz                         0.12.0
torch                         2.0.1
torch-ema                     0.3
torch4keras                   0.1.5
torchaudio                    2.0.1+cu118
torchmetrics                  1.2.0
torchvision                   0.15.1+cu118
tornado                       6.3.3
tqdm                          4.66.1
tracerite                     1.1.0
trafilatura                   1.6.2
transformers                  4.35.2
translatehtml                 1.5.2
translators                   5.8.9
trio                          0.22.2
trio-websocket                0.11.1
triton                        2.0.0
Twisted                       22.10.0
typer                         0.7.0
typeshed-client               2.4.0
typing_extensions             4.8.0
tzdata                        2023.3
tzlocal                       5.2
uc-micro-py                   1.0.2
ujson                         5.8.0
Unidecode                     1.3.7
uritools                      4.0.2
urlextract                    1.8.0
urllib3                       1.26.18
uvicorn                       0.24.0.post1
uvloop                        0.19.0
validators                    0.22.0
virtualenv                    20.24.6
w3lib                         2.1.2
waitress                      2.1.2
wandb                         0.15.12
warcio                        1.7.4
wasabi                        1.1.2
watchdog                      3.0.0
wavedrom                      2.0.3.post3
wcwidth                       0.2.8
websockets                    10.4
Werkzeug                      2.2.2
wget                          3.2
wheel                         0.41.2
wrapt                         1.16.0
wsproto                       1.2.0
xlrd                          1.2.0
XlsxWriter                    3.1.9
xorbits                       0.7.1
xoscar                        0.1.4
xxhash                        3.4.1
yarl                          1.9.2
zipfile36                     0.1.3
zipp                          3.17.0
zope.event                    5.0
zope.interface                6.1
zstandard                     0.22.0

The text was updated successfully, but these errors were encountered:

zhijianma · 2024-01-04T06:43:48Z

Yes， we will change datatype of simhash to string, for pyarrow is incompatible with uint64 Now.

simplew2011 added the bug Something isn't working label Jan 3, 2024

github-project-automation bot added this to data-juicer Jan 3, 2024

github-project-automation bot moved this to Todo in data-juicer Jan 3, 2024

HYLcool moved this from Todo to In Progress in data-juicer Jan 4, 2024

HYLcool assigned zhijianma Jan 4, 2024

zhijianma linked a pull request Jan 4, 2024 that will close this issue

fix: change datatype of simhash to string, because pyarrow is incompatible with uint64 #170

Merged

zhijianma closed this as completed in #170 Jan 4, 2024

github-project-automation bot moved this from In Progress to Done in data-juicer Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: OverflowError: Python int too large to convert to C long #168

[Bug]: OverflowError: Python int too large to convert to C long #168

simplew2011 commented Jan 3, 2024 •

edited

Loading

zhijianma commented Jan 4, 2024

[Bug]: OverflowError: Python int too large to convert to C long #168

[Bug]: OverflowError: Python int too large to convert to C long #168

Comments

simplew2011 commented Jan 3, 2024 • edited Loading

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Logs 报错日志

Screenshots 截图

Additional 额外信息

zhijianma commented Jan 4, 2024

simplew2011 commented Jan 3, 2024 •

edited

Loading