Tokenizer's normalization preprocessor cause misalignment in return_offsets_mapping for tokenizer classification task #2532

cosmeowpawlitan · 2021-06-22T10:08:18Z

This colab notebook implements a token classification input pipeline extending the logic from this hugging example.

The pipeline works fine with most instance in different languages, but unfortunately, the Japanese Kana ligature (a form of abbreviation? I don't know Japanese well) break the alignment of return_offsets_mapping:

Without the try catch block, it riase ValueError: NumPy boolean array indexing assignment cannot assign 88 input values to the 87 output values where the mask is true, example shown here (another colab notebook)

It is clear that the normalizer is the process that break the alignment, as it is observed that tokenizer._tokenizer.normalizer.normalize_str('ヿ') return 'コト'.

One workaround is to include tokenizer._tokenizer.normalizer.normalize_str before the tokenizer preprocessing pipeline, which is also provided in the first colab notebook with the name udposTestDatasetWorkaround.

I guess similar logics should be included inside the tokenizer and the offsets_mapping generation process such that user don't need to include them in their code. But I don't understand the code of tokenizer well that I think I am not able to do this.

p.s.
I am using my own dataset building script in the provided example, but the script should be equivalent to the changes made by this update
get_dataset is just a simple wrapping for load_dataset
and the tokenizer is just XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-large")

The text was updated successfully, but these errors were encountered:

albertvillanova · 2021-06-23T04:48:46Z

Hi @JerryIsHere, thanks for reporting the issue. But are you sure this is a bug in HuggingFace Datasets?

cosmeowpawlitan · 2021-06-23T04:52:03Z

Hi @JerryIsHere, thanks for reporting the issue. But are you sure this is a bug in HuggingFace Datasets?

Oh, I am sorry
I would reopen the post on huggingface/transformers

cosmeowpawlitan added the bug Something isn't working label Jun 22, 2021

cosmeowpawlitan closed this as completed Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer's normalization preprocessor cause misalignment in return_offsets_mapping for tokenizer classification task #2532

Tokenizer's normalization preprocessor cause misalignment in return_offsets_mapping for tokenizer classification task #2532

cosmeowpawlitan commented Jun 22, 2021 •

edited

Loading

albertvillanova commented Jun 23, 2021

cosmeowpawlitan commented Jun 23, 2021

Tokenizer's normalization preprocessor cause misalignment in return_offsets_mapping for tokenizer classification task #2532

Tokenizer's normalization preprocessor cause misalignment in return_offsets_mapping for tokenizer classification task #2532

Comments

cosmeowpawlitan commented Jun 22, 2021 • edited Loading

albertvillanova commented Jun 23, 2021

cosmeowpawlitan commented Jun 23, 2021

cosmeowpawlitan commented Jun 22, 2021 •

edited

Loading