[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder #1513

Narsil · 2024-04-24T15:50:51Z

Causes issues with ByteLevel messing up some AddedTokens with some
utf-8 range used in the bytelevel mapping.

This commit tests the extend of the damage of ignoring the decoder for
those tokens.

decoder Causes issues with `ByteLevel` messing up some `AddedTokens` with some utf-8 range used in the bytelevel mapping. This commit tests the extend of the damage of ignoring the decoder for those tokens.

HuggingFaceDocBuilderDev · 2024-04-24T15:56:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM, let's add a test for the decoder to make sure the behavior we are looking for is enable

ArthurZucker · 2024-04-25T09:25:50Z

tokenizers/src/tokenizer/mod.rs

+                if !result.is_empty() {
+                    result.push(' ');
+                }


why are we adding an extra space before the added token? IMO we should only do this if there is no decoder (default join on " ")

Because this breaks existing tests other wise (in node and Python, there are tests that just add regular token and expect the decoding to happen.

I've updated to make the test pass without modification, we could envision changing those test.

https://github.com/huggingface/tokenizers/actions/runs/8819865853/job/24212114706#step:12:262

These test are done when there is no decoder no?
I mean the previous behaviour does not add spaces between added token if there is a decoder

Yes it did (because the test wasn't using a decoder, so by default no decoder means space separated)

thusinh1969 · 2024-04-29T00:22:44Z

Gents,
When can this fix be ready please ?

Thanks,
Steve

Narsil · 2024-04-29T13:58:53Z

FWIW I ran tokenizers tests on transformers and didn't find any related error in the test suite.

RUN_SLOW=1 pytest -sv tests/ -k tokenizers
AILED tests/tokenization/test_tokenization_utils.py::TokenizerUtilsTest::test_pretrained_tokenizers - AttributeError: type object 'GPT2Tokenizer' has no attribute 'max_model_input_sizes'
============================================================================================================ 1 failed, 562 passed, 36 skipped, 121 warnings in 72.36s (0:01:12) ============================================================================================================

I figured the issue has nothing to do with it.

Narsil · 2024-04-29T16:26:32Z

More tests with actual failures

pytest -sv tests/ -k tokenizer

FAILED tests/models/bartpho/test_tokenization_bartpho.py::BartphoTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/blenderbot_small/test_tokenization_blenderbot_small.py::BlenderbotSmallTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/idefics/test_processor_idefics.py::IdeficsProcessorTest::test_tokenizer_left_padding - AssertionError: '<unk[20 chars]><unk><unk><unk><unk...
FAILED tests/models/idefics/test_processor_idefics.py::IdeficsProcessorTest::test_tokenizer_padding - AssertionError: '<s>Describe this image.\nAssistant...
FAILED tests/models/luke/test_tokenization_luke.py::LukeTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/mluke/test_tokenization_mluke.py::MLukeTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/mpnet/test_tokenization_mpnet.py::MPNetTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speech_to_text/test_tokenization_speech_to_text.py::SpeechToTextTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speech_to_text_2/test_tokenization_speech_to_text_2.py::SpeechToTextTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speecht5/test_tokenization_speecht5.py::SpeechT5TokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/vits/test_tokenization_vits.py::VitsTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/wav2vec2/test_tokenization_wav2vec2.py::Wav2Vec2CTCTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...

tokenizers/src/tokenizer/mod.rs

ArthurZucker · 2024-05-06T09:49:42Z

The failing test is basically the same one run on bert

ArthurZucker

Actually this kinda breaks the way we decode, because we don't decode everything, meta space or anything that relies on the idx of the token is broken

ArthurZucker · 2024-05-17T10:39:54Z

I'll fix it

…in the decoder (#1513)" This reverts commit 25aee8b.

#1569) * Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)" This reverts commit 25aee8b. * don't remove audit * deprecate id_to_token * use simple id to token * don't break id_to_token since we are deprecating anyways?

[BREAKING CHANGE] Ignore added_tokens (both special and not) in the

a5d4584

decoder Causes issues with `ByteLevel` messing up some `AddedTokens` with some utf-8 range used in the bytelevel mapping. This commit tests the extend of the damage of ignoring the decoder for those tokens.

Narsil requested a review from ArthurZucker April 24, 2024 15:51

Narsil added 6 commits April 24, 2024 18:09

Format.

ed9eb5a

Installing cargo audit.

5b684db

Minor fix.

44f6a84

Fixing "bug" in node/python.

cb928c0

Autoformat.

53be059

Clippy.

6682e5c

ArthurZucker reviewed Apr 25, 2024

View reviewed changes

ArthurZucker mentioned this pull request Apr 25, 2024

I can not extend vocab of LLaMA-3 using sentencepiece anymore vs LLaMA-2 ?!? meta-llama/llama3#67

Closed

Only prefix space when there's no decoder.

4992d24

Narsil requested a review from ArthurZucker April 29, 2024 16:28

ArthurZucker approved these changes Apr 30, 2024

View reviewed changes

tokenizers/src/tokenizer/mod.rs Show resolved Hide resolved

tokenizers/src/tokenizer/mod.rs Show resolved Hide resolved

Narsil merged commit 25aee8b into main May 6, 2024
12 checks passed

Narsil deleted the break_decoder_added_tokens branch May 6, 2024 09:49

thusinh1969 mentioned this pull request May 6, 2024

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Closed

ArthurZucker reviewed May 17, 2024

View reviewed changes

polarathene mentioned this pull request Jun 3, 2024

Fix tokenizer write on read only file system EricLBuehler/mistral.rs#373

Merged

ArthurZucker added a commit that referenced this pull request Jul 12, 2024

Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) …

3eed134

…in the decoder (#1513)" This reverts commit 25aee8b.

ArthurZucker mentioned this pull request Jul 12, 2024

Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … #1569

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder #1513

[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder #1513

Narsil commented Apr 24, 2024

HuggingFaceDocBuilderDev commented Apr 24, 2024

ArthurZucker left a comment

ArthurZucker Apr 25, 2024

Narsil Apr 25, 2024

Narsil Apr 25, 2024

ArthurZucker Apr 25, 2024

Narsil Apr 29, 2024

thusinh1969 commented Apr 29, 2024

Narsil commented Apr 29, 2024

Narsil commented Apr 29, 2024 •

edited

Loading

ArthurZucker commented May 6, 2024

ArthurZucker left a comment

ArthurZucker commented May 17, 2024

[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder #1513

[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder #1513

Conversation

Narsil commented Apr 24, 2024

HuggingFaceDocBuilderDev commented Apr 24, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Apr 25, 2024

Choose a reason for hiding this comment

Narsil Apr 25, 2024

Choose a reason for hiding this comment

Narsil Apr 25, 2024

Choose a reason for hiding this comment

ArthurZucker Apr 25, 2024

Choose a reason for hiding this comment

Narsil Apr 29, 2024

Choose a reason for hiding this comment

thusinh1969 commented Apr 29, 2024

Narsil commented Apr 29, 2024

Narsil commented Apr 29, 2024 • edited Loading

ArthurZucker commented May 6, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented May 17, 2024

Narsil commented Apr 29, 2024 •

edited

Loading