Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer's normalization preprocessor cause misalignment in return_offsets_mapping for tokenizer classification task #2532

Closed
cosmeowpawlitan opened this issue Jun 22, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@cosmeowpawlitan
Copy link
Contributor

cosmeowpawlitan commented Jun 22, 2021

This colab notebook implements a token classification input pipeline extending the logic from this hugging example.

The pipeline works fine with most instance in different languages, but unfortunately, the Japanese Kana ligature (a form of abbreviation? I don't know Japanese well) break the alignment of return_offsets_mapping:
image

Without the try catch block, it riase ValueError: NumPy boolean array indexing assignment cannot assign 88 input values to the 87 output values where the mask is true, example shown here (another colab notebook)

It is clear that the normalizer is the process that break the alignment, as it is observed that tokenizer._tokenizer.normalizer.normalize_str('ヿ') return 'コト'.

One workaround is to include tokenizer._tokenizer.normalizer.normalize_str before the tokenizer preprocessing pipeline, which is also provided in the first colab notebook with the name udposTestDatasetWorkaround.

I guess similar logics should be included inside the tokenizer and the offsets_mapping generation process such that user don't need to include them in their code. But I don't understand the code of tokenizer well that I think I am not able to do this.

p.s.
I am using my own dataset building script in the provided example, but the script should be equivalent to the changes made by this update
get_dataset is just a simple wrapping for load_dataset
and the tokenizer is just XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-large")

@cosmeowpawlitan cosmeowpawlitan added the bug Something isn't working label Jun 22, 2021
@albertvillanova
Copy link
Member

Hi @JerryIsHere, thanks for reporting the issue. But are you sure this is a bug in HuggingFace Datasets?

@cosmeowpawlitan
Copy link
Contributor Author

Hi @JerryIsHere, thanks for reporting the issue. But are you sure this is a bug in HuggingFace Datasets?

Oh, I am sorry
I would reopen the post on huggingface/transformers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants