You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Without the try catch block, it riase ValueError: NumPy boolean array indexing assignment cannot assign 88 input values to the 87 output values where the mask is true, example shown here (another colab notebook)
It is clear that the normalizer is the process that break the alignment, as it is observed that tokenizer._tokenizer.normalizer.normalize_str('ヿ') return 'コト'.
One workaround is to include tokenizer._tokenizer.normalizer.normalize_str before the tokenizer preprocessing pipeline, which is also provided in the first colab notebook with the name udposTestDatasetWorkaround.
I guess similar logics should be included inside the tokenizer and the offsets_mapping generation process such that user don't need to include them in their code. But I don't understand the code of tokenizer well that I think I am not able to do this.
p.s. I am using my own dataset building script in the provided example, but the script should be equivalent to the changes made by this update get_dataset is just a simple wrapping for load_dataset
and the tokenizer is just XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-large")
The text was updated successfully, but these errors were encountered:
This colab notebook implements a token classification input pipeline extending the logic from this hugging example.
The pipeline works fine with most instance in different languages, but unfortunately, the Japanese Kana ligature (a form of abbreviation? I don't know Japanese well) break the alignment of
return_offsets_mapping
:Without the try catch block, it riase
ValueError: NumPy boolean array indexing assignment cannot assign 88 input values to the 87 output values where the mask is true
, example shown here (another colab notebook)It is clear that the normalizer is the process that break the alignment, as it is observed that
tokenizer._tokenizer.normalizer.normalize_str('ヿ')
return 'コト'.One workaround is to include
tokenizer._tokenizer.normalizer.normalize_str
before the tokenizer preprocessing pipeline, which is also provided in the first colab notebook with the nameudposTestDatasetWorkaround
.I guess similar logics should be included inside the tokenizer and the offsets_mapping generation process such that user don't need to include them in their code. But I don't understand the code of tokenizer well that I think I am not able to do this.
p.s.
I am using my own dataset building script in the provided example, but the script should be equivalent to the changes made by this update
get_dataset
is just a simple wrapping forload_dataset
and the
tokenizer
is justXLMRobertaTokenizerFast.from_pretrained("xlm-roberta-large")
The text was updated successfully, but these errors were encountered: