Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got pyarrow error when loading a dataset while adding special tokens into the tokenizer #2206

Closed
yana-xuyan opened this issue Apr 11, 2021 · 7 comments · Fixed by #3234
Closed
Labels
bug Something isn't working

Comments

@yana-xuyan
Copy link

I added five more special tokens into the GPT2 tokenizer. But after that, when I try to pre-process the data using my previous code, I got an error shown below:

Traceback (most recent call last):
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1687, in _map_single
writer.write(example)
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_writer.py", line 296, in write
self.write_on_file()
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_writer.py", line 270, in write_on_file
pa_array = pa.array(typed_sequence)
File "pyarrow/array.pxi", line 222, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib.handle_arrow_array_protocol
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_writer.py", line 108, in arrow_array
out = out.cast(pa.list
(self.optimized_int_type))
File "pyarrow/array.pxi", line 810, in pyarrow.lib.Array.cast
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/pyarrow/compute.py", line 281, in cast
return call_function("cast", [arr], options)
File "pyarrow/_compute.pyx", line 465, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 294, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 50259 not in range: -128 to 127

Do you have any idea about it?

@mariosasko
Copy link
Collaborator

Hi,

the output of the tokenizers is treated specially in the lib to optimize the dataset size (see the code here). It looks like that one of the values in a dictionary returned by the tokenizer is out of the assumed range.
Can you please provide a minimal reproducible example for more help?

@albertvillanova
Copy link
Member

albertvillanova commented Apr 14, 2021

Hi @yana-xuyan, thanks for reporting.

As clearly @mariosasko explained, datasets performs some optimizations in order to reduce the size of the dataset cache files. And one of them is storing the field special_tokens_mask as int8, which means that this field can only contain integers between -128 to 127. As your message error states, one of the values of this field is 50259, and therefore it cannot be stored as an int8.

Maybe we could implement a way to disable this optimization and allow using any integer value; although the size of the cache files would be much larger.

@albertvillanova albertvillanova added the bug Something isn't working label Apr 14, 2021
@thomas-happify
Copy link

I'm facing same issue @mariosasko @albertvillanova

ArrowInvalid: Integer value 50260 not in range: -128 to 127

To reproduce:

SPECIAL_TOKENS = ['<bos>','<eos>','<speaker1>','<speaker2>','<pad>']
ATTR_TO_SPECIAL_TOKEN = {
    'bos_token': '<bos>', 
    'eos_token': '<eos>', 
    'pad_token': '<pad>',
    'additional_special_tokens': ['<speaker1>', '<speaker2>']
    }

tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=False)
num_added_tokens =tokenizer.add_special_tokens(ATTR_TO_SPECIAL_TOKEN)
vocab_size = len(self.tokenizer.encoder) + num_added_tokens
vocab =tokenizer.get_vocab()

pad_index = tokenizer.pad_token_id
eos_index = tokenizer.eos_token_id
bos_index = tokenizer.bos_token_id
speaker1_index = vocab["<speaker1>"]
speaker2_index = vocab["<speaker2>"]
tokenizer.decode(['50260'])
'<speaker1>'

@gregg-ADP
Copy link

@mariosasko
I am hitting this bug in the Bert tokenizer too. I see that @albertvillanova labeled this as a bug back in April. Has there been a fix released yet?
What I did for now is to just disable the optimization in the HF library. @yana-xuyan and @thomas-happify, is that what you did and did that work for you?

@mariosasko
Copy link
Collaborator

Hi @gregg-ADP,

This is still a bug.

As @albertvillanova has suggested, maybe it's indeed worth adding a variable to config.py to have a way to disable this behavior.

In the meantime, this forced optimization can be disabled by specifying features (of the returned examples) in the map call:

from datasets import *
... # dataset init
ds.map(process_example, features=Features({"special_tokens_mask": Sequence(Value("int32")), ... rest of the features}) 

cc @lhoestq so he is also aware of this issue

@gwc4github
Copy link

Thanks for the quick reply @mariosasko. What I did was to changed the optimizer to use int32 instead of int8.
What you're suggesting specifies the type for each feature explicitly without changing the HF code. This is definitely a better option. However, we are hitting a new error later:

  File "/Users/ccccc/PycharmProjects/aaaa-ml/venv-source/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'pos'

Where 'pos' is the name of a new feature we added. Do you agree that your way of fixing the optimizer issue will not fix our new issue? If not, I will continue with this optimizer fix until we resolve our other issue.

@mariosasko
Copy link
Collaborator

Hi @gwc4github,

the fix was merged a few minutes ago, and it doesn't require any changes on the user side (e.g. no need for specifying features). If you find time, feel free to install datasets from master with:

pip install git+https://github.com/huggingface/datasets.git

and let us know if it works for your use case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants