Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder #1513

Merged
merged 8 commits into from
May 6, 2024

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Apr 24, 2024

Causes issues with ByteLevel messing up some AddedTokens with some
utf-8 range used in the bytelevel mapping.

This commit tests the extend of the damage of ignoring the decoder for
those tokens.

decoder

Causes issues with `ByteLevel` messing up some `AddedTokens` with some
utf-8 range used in the bytelevel mapping.

This commit tests the extend of the damage of ignoring the decoder for
those tokens.
@Narsil Narsil requested a review from ArthurZucker April 24, 2024 15:51
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's add a test for the decoder to make sure the behavior we are looking for is enable

Comment on lines 863 to 865
if !result.is_empty() {
result.push(' ');
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding an extra space before the added token? IMO we should only do this if there is no decoder (default join on " ")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this breaks existing tests other wise (in node and Python, there are tests that just add regular token and expect the decoding to happen.

I've updated to make the test pass without modification, we could envision changing those test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test are done when there is no decoder no?
I mean the previous behaviour does not add spaces between added token if there is a decoder

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it did (because the test wasn't using a decoder, so by default no decoder means space separated)

@thusinh1969
Copy link

Gents,
When can this fix be ready please ?

Thanks,
Steve

@Narsil
Copy link
Collaborator Author

Narsil commented Apr 29, 2024

FWIW I ran tokenizers tests on transformers and didn't find any related error in the test suite.

RUN_SLOW=1 pytest -sv tests/ -k tokenizers
AILED tests/tokenization/test_tokenization_utils.py::TokenizerUtilsTest::test_pretrained_tokenizers - AttributeError: type object 'GPT2Tokenizer' has no attribute 'max_model_input_sizes'
============================================================================================================ 1 failed, 562 passed, 36 skipped, 121 warnings in 72.36s (0:01:12) ============================================================================================================

I figured the issue has nothing to do with it.

@Narsil
Copy link
Collaborator Author

Narsil commented Apr 29, 2024

More tests with actual failures

pytest -sv tests/ -k tokenizer
FAILED tests/models/bartpho/test_tokenization_bartpho.py::BartphoTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/blenderbot_small/test_tokenization_blenderbot_small.py::BlenderbotSmallTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/idefics/test_processor_idefics.py::IdeficsProcessorTest::test_tokenizer_left_padding - AssertionError: '<unk[20 chars]><unk><unk><unk><unk...
FAILED tests/models/idefics/test_processor_idefics.py::IdeficsProcessorTest::test_tokenizer_padding - AssertionError: '<s>Describe this image.\nAssistant...
FAILED tests/models/luke/test_tokenization_luke.py::LukeTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/mluke/test_tokenization_mluke.py::MLukeTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/mpnet/test_tokenization_mpnet.py::MPNetTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speech_to_text/test_tokenization_speech_to_text.py::SpeechToTextTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speech_to_text_2/test_tokenization_speech_to_text_2.py::SpeechToTextTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speecht5/test_tokenization_speecht5.py::SpeechT5TokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/vits/test_tokenization_vits.py::VitsTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/wav2vec2/test_tokenization_wav2vec2.py::Wav2Vec2CTCTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...

@Narsil Narsil requested a review from ArthurZucker April 29, 2024 16:28
tokenizers/src/tokenizer/mod.rs Show resolved Hide resolved
tokenizers/src/tokenizer/mod.rs Show resolved Hide resolved
@Narsil Narsil merged commit 25aee8b into main May 6, 2024
12 checks passed
@Narsil Narsil deleted the break_decoder_added_tokens branch May 6, 2024 09:49
@ArthurZucker
Copy link
Collaborator

The failing test is basically the same one run on bert

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this kinda breaks the way we decode, because we don't decode everything, meta space or anything that relies on the idx of the token is broken

@ArthurZucker
Copy link
Collaborator

I'll fix it

ArthurZucker added a commit that referenced this pull request Jul 12, 2024
ArthurZucker added a commit that referenced this pull request Jul 12, 2024
#1569)

* Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)"

This reverts commit 25aee8b.

* don't remove audit

* deprecate id_to_token

* use simple id to token

* don't break id_to_token since we are deprecating anyways?
ArthurZucker added a commit that referenced this pull request Jul 12, 2024
#1569)

* Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)"

This reverts commit 25aee8b.

* don't remove audit

* deprecate id_to_token

* use simple id to token

* don't break id_to_token since we are deprecating anyways?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants