-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Got pyarrow error when loading a dataset while adding special tokens into the tokenizer #2206
Comments
Hi, the output of the tokenizers is treated specially in the lib to optimize the dataset size (see the code here). It looks like that one of the values in a dictionary returned by the tokenizer is out of the assumed range. |
Hi @yana-xuyan, thanks for reporting. As clearly @mariosasko explained, Maybe we could implement a way to disable this optimization and allow using any integer value; although the size of the cache files would be much larger. |
I'm facing same issue @mariosasko @albertvillanova
To reproduce: SPECIAL_TOKENS = ['<bos>','<eos>','<speaker1>','<speaker2>','<pad>']
ATTR_TO_SPECIAL_TOKEN = {
'bos_token': '<bos>',
'eos_token': '<eos>',
'pad_token': '<pad>',
'additional_special_tokens': ['<speaker1>', '<speaker2>']
}
tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=False)
num_added_tokens =tokenizer.add_special_tokens(ATTR_TO_SPECIAL_TOKEN)
vocab_size = len(self.tokenizer.encoder) + num_added_tokens
vocab =tokenizer.get_vocab()
pad_index = tokenizer.pad_token_id
eos_index = tokenizer.eos_token_id
bos_index = tokenizer.bos_token_id
speaker1_index = vocab["<speaker1>"]
speaker2_index = vocab["<speaker2>"] tokenizer.decode(['50260'])
'<speaker1>' |
@mariosasko |
Hi @gregg-ADP, This is still a bug. As @albertvillanova has suggested, maybe it's indeed worth adding a variable to In the meantime, this forced optimization can be disabled by specifying from datasets import *
... # dataset init
ds.map(process_example, features=Features({"special_tokens_mask": Sequence(Value("int32")), ... rest of the features}) cc @lhoestq so he is also aware of this issue |
Thanks for the quick reply @mariosasko. What I did was to changed the optimizer to use int32 instead of int8.
Where 'pos' is the name of a new feature we added. Do you agree that your way of fixing the optimizer issue will not fix our new issue? If not, I will continue with this optimizer fix until we resolve our other issue. |
Hi @gwc4github, the fix was merged a few minutes ago, and it doesn't require any changes on the user side (e.g. no need for specifying
and let us know if it works for your use case! |
I added five more special tokens into the GPT2 tokenizer. But after that, when I try to pre-process the data using my previous code, I got an error shown below:
Traceback (most recent call last):
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1687, in _map_single
writer.write(example)
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_writer.py", line 296, in write
self.write_on_file()
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_writer.py", line 270, in write_on_file
pa_array = pa.array(typed_sequence)
File "pyarrow/array.pxi", line 222, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib.handle_arrow_array_protocol
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/datasets/arrow_writer.py", line 108, in arrow_array
out = out.cast(pa.list(self.optimized_int_type))
File "pyarrow/array.pxi", line 810, in pyarrow.lib.Array.cast
File "/home/xuyan/anaconda3/envs/convqa/lib/python3.7/site-packages/pyarrow/compute.py", line 281, in cast
return call_function("cast", [arr], options)
File "pyarrow/_compute.pyx", line 465, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 294, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 50259 not in range: -128 to 127
Do you have any idea about it?
The text was updated successfully, but these errors were encountered: