Add bytelevel normalizer to fix decode when adding tokens to BPE #1555

ArthurZucker · 2024-06-18T06:50:23Z

This revert the previous breaking change.

Also add a new ByteLevel normalizer, which replaces the ByteLevel pre_tokenizer.
Checked that we can add chines / Cyrillic tokens which are properly encoded and decoder.

Fixes #1392

HuggingFaceDocBuilderDev · 2024-06-18T12:16:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2024-06-28T09:39:22Z

tokenizers/src/tokenizer/mod.rs

+        let tokens = ids
+            .iter()
+            .filter_map(|id| {
+                self.added_vocabulary
+                    .id_to_token(*id, &self.model)
+                    .filter(|token| {
+                        !skip_special_tokens || !self.added_vocabulary.is_special_token(token)
+                    })
+            })
+            .collect::<Vec<_>>();
+
+        if let Some(decoder) = &self.decoder {
+            decoder.decode(tokens)


reverted to what we originally had

ArthurZucker · 2024-07-11T06:55:22Z

The test passes locally !

tokenizers/src/pre_tokenizers/byte_level.rs

tokenizers/src/tokenizer/added_vocabulary.rs

tokenizers/src/tokenizer/pre_tokenizer.rs

McPatate · 2024-07-11T08:35:13Z

tokenizers/src/normalizers/byte_level.rs

+    /// Strip the normalized string inplace
+    fn normalize(&self, normalized: &mut NormalizedString) -> Result<()> {


Why doesn't this live in NormalizedString like so:

impl NormalizedString { fn normalize_byte_level(&mut self) -> Result<()> { // ... } }

?
Feels a bit weird to have an empty stateless struct for just a function, but may be due to the structure of tokenizers / python.

tokenizers/src/models/bpe/trainer.rs

add it fix

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

ArthurZucker mentioned this pull request Jun 19, 2024

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Closed

ArthurZucker force-pushed the fix-decode branch from dd6e99b to f53e514 Compare June 28, 2024 07:36

ArthurZucker marked this pull request as ready for review June 28, 2024 08:30

ArthurZucker requested a review from Narsil June 28, 2024 09:38

ArthurZucker commented Jun 28, 2024

View reviewed changes

ArthurZucker requested a review from McPatate July 11, 2024 06:55

McPatate reviewed Jul 11, 2024

View reviewed changes

ArthurZucker force-pushed the fix-decode branch from bf8f664 to f0c40e5 Compare July 12, 2024 05:34

ArthurZucker and others added 7 commits July 12, 2024 07:38

feature dependent test

3c4779b

nit about 嗎

9f0954c

update

6712fe3

actuallyfix it

00d032d

update the test

22613cc

add it fix

stub

dbbf905

Update tokenizers/src/pre_tokenizers/byte_level.rs

8213ad8

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

ArthurZucker force-pushed the fix-decode branch from f0c40e5 to 8213ad8 Compare July 12, 2024 05:39

skip failing test

7032172

ArthurZucker changed the title ~~Fix decode~~ Add bytelevel normalizer to fix decode when adding tokens to BPE Jul 15, 2024

add normalizer to init

ebf3a8e

ArthurZucker merged commit 4ea2f23 into main Jul 15, 2024
13 checks passed

ArthurZucker deleted the fix-decode branch July 15, 2024 10:12

ArthurZucker mentioned this pull request Oct 22, 2024

Is the BOS token id of 128000 **hardcoded** into the llama 3.2 tokenizer? huggingface/transformers#33998

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bytelevel normalizer to fix decode when adding tokens to BPE #1555

Add bytelevel normalizer to fix decode when adding tokens to BPE #1555

ArthurZucker commented Jun 18, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 18, 2024

ArthurZucker Jun 28, 2024

ArthurZucker commented Jul 11, 2024

McPatate Jul 11, 2024

		/// Strip the normalized string inplace
		fn normalize(&self, normalized: &mut NormalizedString) -> Result<()> {

Add bytelevel normalizer to fix decode when adding tokens to BPE #1555

Add bytelevel normalizer to fix decode when adding tokens to BPE #1555

Conversation

ArthurZucker commented Jun 18, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jun 18, 2024

ArthurZucker Jun 28, 2024

Choose a reason for hiding this comment

ArthurZucker commented Jul 11, 2024

McPatate Jul 11, 2024

Choose a reason for hiding this comment

ArthurZucker commented Jun 18, 2024 •

edited

Loading