Got pyarrow error when loading a dataset while adding special tokens into the tokenizer #2206

yana-xuyan · 2021-04-11T08:40:09Z

I added five more special tokens into the GPT2 tokenizer. But after that, when I try to pre-process the data using my previous code, I got an error shown below:

Traceback (most recent call last):
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1687, in _map_single
writer.write(example)
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_writer.py", line 296, in write
self.write_on_file()
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_writer.py", line 270, in write_on_file
pa_array = pa.array(typed_sequence)
File "pyarrow/array.pxi", line 222, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib.handle_arrow_array_protocol
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_writer.py", line 108, in arrow_array
out = out.cast(pa.list(self.optimized_int_type))
File "pyarrow/array.pxi", line 810, in pyarrow.lib.Array.cast
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/pyarrow/compute.py", line 281, in cast
return call_function("cast", [arr], options)
File "pyarrow/_compute.pyx", line 465, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 294, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 50259 not in range: -128 to 127

Do you have any idea about it?

mariosasko · 2021-04-11T14:50:34Z

Hi,

the output of the tokenizers is treated specially in the lib to optimize the dataset size (see the code here). It looks like that one of the values in a dictionary returned by the tokenizer is out of the assumed range.
Can you please provide a minimal reproducible example for more help?

albertvillanova · 2021-04-14T06:05:09Z

Hi @yana-xuyan, thanks for reporting.

As clearly @mariosasko explained, datasets performs some optimizations in order to reduce the size of the dataset cache files. And one of them is storing the field special_tokens_mask as int8, which means that this field can only contain integers between -128 to 127. As your message error states, one of the values of this field is 50259, and therefore it cannot be stored as an int8.

Maybe we could implement a way to disable this optimization and allow using any integer value; although the size of the cache files would be much larger.

thomas-happify · 2021-10-29T21:37:03Z

I'm facing same issue @mariosasko @albertvillanova

ArrowInvalid: Integer value 50260 not in range: -128 to 127

To reproduce:

SPECIAL_TOKENS = ['<bos>','<eos>','<speaker1>','<speaker2>','<pad>']
ATTR_TO_SPECIAL_TOKEN = {
    'bos_token': '<bos>', 
    'eos_token': '<eos>', 
    'pad_token': '<pad>',
    'additional_special_tokens': ['<speaker1>', '<speaker2>']
    }

tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=False)
num_added_tokens =tokenizer.add_special_tokens(ATTR_TO_SPECIAL_TOKEN)
vocab_size = len(self.tokenizer.encoder) + num_added_tokens
vocab =tokenizer.get_vocab()

pad_index = tokenizer.pad_token_id
eos_index = tokenizer.eos_token_id
bos_index = tokenizer.bos_token_id
speaker1_index = vocab["<speaker1>"]
speaker2_index = vocab["<speaker2>"]

tokenizer.decode(['50260'])
'<speaker1>'

gregg-ADP · 2021-11-03T20:23:12Z

@mariosasko
I am hitting this bug in the Bert tokenizer too. I see that @albertvillanova labeled this as a bug back in April. Has there been a fix released yet?
What I did for now is to just disable the optimization in the HF library. @yana-xuyan and @thomas-happify, is that what you did and did that work for you?

mariosasko · 2021-11-03T20:54:27Z

Hi @gregg-ADP,

This is still a bug.

As @albertvillanova has suggested, maybe it's indeed worth adding a variable to config.py to have a way to disable this behavior.

In the meantime, this forced optimization can be disabled by specifying features (of the returned examples) in the map call:

from datasets import *
... # dataset init
ds.map(process_example, features=Features({"special_tokens_mask": Sequence(Value("int32")), ... rest of the features})

cc @lhoestq so he is also aware of this issue

gwc4github · 2021-11-04T14:52:03Z

Thanks for the quick reply @mariosasko. What I did was to changed the optimizer to use int32 instead of int8.
What you're suggesting specifies the type for each feature explicitly without changing the HF code. This is definitely a better option. However, we are hitting a new error later:

  File "/Users/ccccc/PycharmProjects/aaaa-ml/venv-source/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'pos'

Where 'pos' is the name of a new feature we added. Do you agree that your way of fixing the optimizer issue will not fix our new issue? If not, I will continue with this optimizer fix until we resolve our other issue.

mariosasko · 2021-11-10T12:18:30Z

Hi @gwc4github,

the fix was merged a few minutes ago, and it doesn't require any changes on the user side (e.g. no need for specifying features). If you find time, feel free to install datasets from master with:

pip install git+https://github.com/huggingface/datasets.git

and let us know if it works for your use case!

albertvillanova added the bug Something isn't working label Apr 14, 2021

mariosasko mentioned this issue Nov 8, 2021

Avoid PyArrow type optimization if it fails #3234

Merged

lhoestq closed this as completed in #3234 Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got pyarrow error when loading a dataset while adding special tokens into the tokenizer #2206

Got pyarrow error when loading a dataset while adding special tokens into the tokenizer #2206

yana-xuyan commented Apr 11, 2021

mariosasko commented Apr 11, 2021

albertvillanova commented Apr 14, 2021 •

edited

Loading

thomas-happify commented Oct 29, 2021

gregg-ADP commented Nov 3, 2021

mariosasko commented Nov 3, 2021

gwc4github commented Nov 4, 2021

mariosasko commented Nov 10, 2021

Got pyarrow error when loading a dataset while adding special tokens into the tokenizer #2206

Got pyarrow error when loading a dataset while adding special tokens into the tokenizer #2206

Comments

yana-xuyan commented Apr 11, 2021

mariosasko commented Apr 11, 2021

albertvillanova commented Apr 14, 2021 • edited Loading

thomas-happify commented Oct 29, 2021

gregg-ADP commented Nov 3, 2021

mariosasko commented Nov 3, 2021

gwc4github commented Nov 4, 2021

mariosasko commented Nov 10, 2021

albertvillanova commented Apr 14, 2021 •

edited

Loading