Add display capabilities to tokenizers objects #1542

ArthurZucker · 2024-06-03T13:22:21Z

>>> from tokenizers import Tokenizer
>>> Tokenizer.from_pretrained("ArthurZ/new-t5-base")
Tokenizer(normalizer=normalizers.Sequence([normalizers.Precompiled(), normalizers.Strip(strip_left=false, strip_right=true), normalizers.Replace(pattern=Regex(" {2,}"), content="▁", regex=SysRegex { regex: Regex { raw: 0x1069ca350 } }]), pre_tokenizer=PreTokenizer(pretok=Metaspace(replacement='▁', prepend_scheme="first", split=true)), model=Unigram(vocab={'<pad>': 0, '</s>': 0, '<unk>': 0, '▁': -2.012292861938477, 'X': -2.486478805541992, ...}, unk_id=2, bos_id=32101, eos_id=32102), post_processor=TemplateProcessing(single=Template([Sequence { id: A, type_id: 0 }, SpecialToken { id: "</s>", type_id: 0 }]), pair=Template([Sequence { id: A, type_id: 0 }, SpecialToken { id: "</s>", type_id: 0 }, Sequence { id: B, type_id: 0 }, SpecialToken { id: "</s>", type_id: 0 }])), decoder=Metaspace(replacement='▁', prepend_scheme="first", split=true), added_vocab=AddedVocabulary(added_tokens_map_r={
        0: AddedToken(content="<pad>", single_word=false, lstrip=false, rstrip=false, normalized=false, special=true), 
        1: AddedToken(content="</s>", single_word=false, lstrip=false, rstrip=false, normalized=false, special=true), 
        2: AddedToken(content="<unk>", single_word=false, lstrip=false, rstrip=false, normalized=false, special=true), ...}, encode_special_tokens=false), truncation=None, padding=None)

HuggingFaceDocBuilderDev · 2024-06-03T13:24:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

bindings/python/src/pre_tokenizers.rs

updates

…add-display fix git suggestion nit __repr__ should use Debug? small updates

…add-display fix git suggestion nit __repr__ should use Debug? small updates Simple lazygit test

…add-display

EricLBuehler · 2024-07-20T00:33:21Z

Trying to build this, we need to use numpy with PyO3 version 0.22. I think that our choices are to either:

Wait for Update pyo3 to 0.22.0 (without "py-clone") PyO3/rust-numpy#435
Fork pyo3-special-method-derive for something like pyo3-special-method-derive-0.21. Currently is uses 0.22.

It's probably easier to fork pyo3-special-method-derive but the issue is of course the code going out of sync.

Edit: see #1574 which implements using 0.21 of pyo3 smd.

ArthurZucker

LGTM

ArthurZucker · 2024-07-28T09:15:33Z

tokenizers/Cargo.toml

@@ -63,6 +63,7 @@ fancy-regex = { version = "0.13", optional = true}
 getrandom = { version = "0.2.10" }
 esaxx-rs = { version = "0.1.10", default-features = false, features=[]}
 monostate = "0.1.12"
+pyo3_special_method_derive_0_21 = {path = "../../pyo3-special-method-derive/pyo3_special_method_derive_0_21"}


Do not forget to remove

ArthurZucker · 2024-07-28T09:17:13Z

tokenizers/src/pre_tokenizers/mod.rs

    BertPreTokenizer(BertPreTokenizer),
    ByteLevel(ByteLevel),
    Delimiter(CharDelimiterSplit),
    Metaspace(Metaspace),
    Whitespace(Whitespace),
+    #[format(fmt = "{}")]
    Sequence(Sequence),
    Split(Split),
    Punctuation(Punctuation),


MIght want to add #[format(fmt = "{}")] for all of them

ArthurZucker · 2024-08-02T11:06:58Z

bindings/python/src/decoders.rs

+    #[format]
    pub(crate) decoder: PyDecoderWrapper,


visibility here forces us to add format

ArthurZucker · 2024-08-02T11:07:11Z

bindings/python/src/decoders.rs

+    #[format(skip)]
+    pub inner: PyObject,


Not implemented yet so skipping for now

ArthurZucker · 2024-08-02T11:07:31Z

bindings/python/src/decoders.rs

+#[format(fmt = "{}")]
 pub(crate) enum PyDecoderWrapper {
    Custom(Arc<RwLock<CustomDecoder>>),
    Wrapped(Arc<RwLock<DecoderWrapper>>),


this will directly display Arc<RwLock<CustomDecoder>>

ArthurZucker · 2024-08-02T11:08:04Z

bindings/python/src/tokenizer.rs

 pub struct PyTokenizer {
-    tokenizer: Tokenizer,
+    pub tokenizer: Tokenizer,


Not a requirement

Narsil

I'm quite worried that we have to litter the rust crate with python specific code.

Rust code is rust code, it should care about python bindings.
Isn't there a way to use Debug formatting or something similar ?

Narsil · 2024-08-02T12:51:58Z

tokenizers/src/tokenizer/mod.rs

@@ -791,7 +862,7 @@ where
            EncodeInput::Single(s1) => (s1, None),
            EncodeInput::Dual(s1, s2) => (s1, Some(s2)),
        };
-
+        println!("thread id: {:?}", current_thread_index());


Narsil · 2024-08-02T12:52:02Z

tokenizers/src/tokenizer/mod.rs

@@ -880,6 +951,7 @@ where
        word_idx: Option<u32>,
        offsets_type: OffsetType,
    ) -> Result<Encoding> {
+        println!("do tokenizer {:?}", current_thread_index());


initial commit

61804d9

ArthurZucker added 15 commits June 3, 2024 15:25

will this work?

a56da5f

make it work for the model for now

f1a6a97

updates

4a49530

update

f4af616

add metaspace

88630dc

update

b9d44da

does not work

a90ec22

current modifications

2224275

current status

4d9204e

working shit

4c2aca1

this kinda works

904ce70

finallllly!

6413810

nits

fda66f5

updates

20c9fc4

almost there

86c77b6

ArthurZucker mentioned this pull request Jun 5, 2024

Adding pretty print of tokenizer #1540

Closed

ArthurZucker commented Jun 5, 2024

View reviewed changes

bindings/python/src/pre_tokenizers.rs Outdated Show resolved Hide resolved

ArthurZucker mentioned this pull request Jun 5, 2024

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

Closed

ArthurZucker and others added 6 commits June 6, 2024 16:34

update

a429642

updates

more nits

3cec010

nit

8d77286

Update bindings/python/src/pre_tokenizers.rs

e48cd3a

ips

27576e5

Merge branch 'add-display' of github.com:huggingface/tokenizers into …

0d9a452

…add-display fix git suggestion nit __repr__ should use Debug? small updates

ArthurZucker force-pushed the add-display branch from f43cefc to 0d9a452 Compare June 7, 2024 09:03

Merge branch 'add-display' of github.com:huggingface/tokenizers into …

35373de

…add-display fix git suggestion nit __repr__ should use Debug? small updates Simple lazygit test

ArthurZucker force-pushed the add-display branch from 0d9a452 to 35373de Compare June 7, 2024 14:36

update

1c6d272

ArthurZucker and others added 7 commits July 19, 2024 21:14

nit

011340b

nice

3fc31d0

updates

51d3f61

Merge branch 'main' into add-display

acb8196

deos

951b6e6

Merge branch 'add-display' of github.com:huggingface/tokenizers into …

c2a320c

…add-display

fix build

2048c02

EricLBuehler mentioned this pull request Jul 20, 2024

Use pyo3 smd v0.21 #1574

Merged

EricLBuehler and others added 15 commits July 20, 2024 07:28

Use pyo3 smd v0.21 (#1574)

104fe0c

stash commit, wanna make sure this is recorded

7db6109

what works a bit ?

c7cd927

update

e4cf65a

fix tokenizer's wrapping

39ffc28

fix normalizer display

0a3bb18

fix!

c436b23

final touch?

e5b059f

full autodebug

ff825a7

remove dict and dir as it's gonna be a bit more involved

c30df0c

remove pub where it is not necessary

b78e11c

fmt =

a99c645

formating

9022470

remove non needed fm

64b8df0

so we only need format when the visibility is not pub but pub(crate)

27cad45

ArthurZucker commented Aug 2, 2024

View reviewed changes

Merge branch 'main' into add-display

ceabef3

Narsil reviewed Aug 2, 2024

View reviewed changes

ArthurZucker mentioned this pull request Aug 7, 2024

Using serde (serde_pyo3) to get __str__ and __repr__ easily. #1588

Merged

ArthurZucker closed this Aug 7, 2024

ArthurZucker deleted the add-display branch August 7, 2024 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add display capabilities to tokenizers objects #1542

Add display capabilities to tokenizers objects #1542

ArthurZucker commented Jun 3, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 3, 2024

EricLBuehler commented Jul 20, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jul 28, 2024

ArthurZucker Jul 28, 2024

ArthurZucker Aug 2, 2024

ArthurZucker Aug 2, 2024

ArthurZucker Aug 2, 2024

ArthurZucker Aug 2, 2024

Narsil left a comment

Narsil Aug 2, 2024

Narsil Aug 2, 2024

Add display capabilities to tokenizers objects #1542

Add display capabilities to tokenizers objects #1542

Conversation

ArthurZucker commented Jun 3, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jun 3, 2024

EricLBuehler commented Jul 20, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Jun 3, 2024 •

edited

Loading

EricLBuehler commented Jul 20, 2024 •

edited

Loading