Improve `PreTrainedTokenizerFast` loading time when there are many added tokens #31404

ydshieh · 2024-06-13T12:20:13Z

What does this PR do?

The condition check

token not in self.added_tokens_decoder

is very slow, especially when there are more added tokens.

Loading time (see code snippet below): 2048 added tokens

before this PR:

3.909789 seconds

after this PR:

0.23008 seconds

from transformers import AutoProcessor, XLMRobertaTokenizerFast, XLMRobertaTokenizer, AutoTokenizer, PreTrainedTokenizerFast

import datetime

ckpt = "ydshieh/dummy_tok"
s = datetime.datetime.now()
p = XLMRobertaTokenizerFast.from_pretrained("my_p")
e = datetime.datetime.now()
print((e-s).total_seconds())

ydshieh · 2024-06-13T12:24:12Z

src/transformers/tokenization_utils_fast.py

+        # Use hash to speed up the very slow operation `token not in added_tokens_decoder`.
+        added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder}


there is a very very tiny chance of hash collision. Do we want to address that possibility?

ArthurZucker · 2024-06-13T12:36:19Z

src/transformers/tokenization_utils_fast.py

+        # Use hash to speed up the very slow operation `token not in added_tokens_decoder`.
+        added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder}


ArthurZucker · 2024-06-13T12:37:04Z

src/transformers/tokenization_utils_fast.py

@@ -172,10 +172,12 @@ def __init__(self, *args, **kwargs):
        # allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens
        # uses the information stored in `added_tokens_decoder`.
        # this is costly for fast tokenizers as we re-compute the regex again. But not all tokens are added tokens
+        # Use hash to speed up the very slow operation `token not in added_tokens_decoder`.
+        added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder}


Suggested change

added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder}

added_tokens_decoder_hash = {hash(token.__str__()) for token in self.added_tokens_decoder}

would that be even faster? hash the string rep?

Also if we implement this at the class level I am wondering if it is not faster? WOuld be computed when you init the object most probably

it's on par :-) OK for me to both, but if using str, I would just do it str(token)

No, you need to make sure normalized, left, right and single word is not the same

str and __str__ will give the same thing. Do you mean to use __repr__?

That will give

'AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)'

while str and __str__
will give, for example,

> </s>

note str will call __str__ under the hood.

yeah sorry repr

ArthurZucker · 2024-06-13T12:37:15Z

Super nice BTW

HuggingFaceDocBuilderDev · 2024-06-13T12:47:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

is it on par to have the init at the AddedTokens class level? (make sure to not import it from tokenizers)

ArthurZucker · 2024-06-13T13:14:43Z

src/transformers/tokenization_utils_fast.py

@@ -172,10 +172,12 @@ def __init__(self, *args, **kwargs):
        # allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens
        # uses the information stored in `added_tokens_decoder`.
        # this is costly for fast tokenizers as we re-compute the regex again. But not all tokens are added tokens
+        # Use hash to speed up the very slow operation `token not in added_tokens_decoder`.
+        added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder}


No, you need to make sure normalized, left, right and single word is not the same

ydshieh · 2024-06-13T13:37:29Z

is it on par to have the init at the AddedTokens class level? (make sure to not import it from tokenizers)

sorry, I don't understand this. Could you elaborate a bit more?

ArthurZucker · 2024-06-13T15:09:56Z

What I mean is implement the hash for this class

ydshieh · 2024-06-13T15:56:17Z

src/transformers/tokenization_utils_fast.py

@@ -172,10 +172,12 @@ def __init__(self, *args, **kwargs):
        # allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens
        # uses the information stored in `added_tokens_decoder`.
        # this is costly for fast tokenizers as we re-compute the regex again. But not all tokens are added tokens
+        # Use hash to speed up the very slow operation `token not in added_tokens_decoder`.
+        added_tokens_decoder_hash = {hash(repr(token)) for token in self.added_tokens_decoder}


repr used

ydshieh · 2024-06-14T08:31:09Z

@ArthurZucker Everything is addressed.

ArthurZucker

sounds good! Thanks for this update

ArthurZucker · 2024-06-18T07:20:57Z

huggingface/tokenizers#1521 a related PR

…ded tokens (#31404) * use hash * use hash * update --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

use hash

5fef53a

ydshieh requested a review from ArthurZucker June 13, 2024 12:20

use hash

2182646

ydshieh commented Jun 13, 2024

View reviewed changes

ArthurZucker reviewed Jun 13, 2024

View reviewed changes

ydshieh requested a review from ArthurZucker June 13, 2024 12:55

ArthurZucker reviewed Jun 13, 2024

View reviewed changes

update

63769ee

ydshieh commented Jun 13, 2024

View reviewed changes

ydshieh requested a review from ArthurZucker June 13, 2024 15:56

ArthurZucker approved these changes Jun 18, 2024

View reviewed changes

ArthurZucker mentioned this pull request Jun 18, 2024

Slow tokenizer-loading when many tokens (~17k) are added by a user. #31134

Closed

4 tasks

ydshieh merged commit 1c7c34b into main Jun 18, 2024
21 checks passed

ydshieh deleted the speedy_fast_token_loading branch June 18, 2024 13:20

itazap pushed a commit that referenced this pull request Jun 18, 2024

Improve PreTrainedTokenizerFast loading time when there are many ad…

787ebd5

…ded tokens (#31404) * use hash * use hash * update --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

itazap pushed a commit that referenced this pull request Jun 20, 2024

Improve PreTrainedTokenizerFast loading time when there are many ad…

7ee2415

…ded tokens (#31404) * use hash * use hash * update --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `PreTrainedTokenizerFast` loading time when there are many added tokens #31404

Improve `PreTrainedTokenizerFast` loading time when there are many added tokens #31404

ydshieh commented Jun 13, 2024 •

edited

Loading

ydshieh Jun 13, 2024

ArthurZucker Jun 13, 2024

ArthurZucker Jun 13, 2024

ArthurZucker Jun 13, 2024

ArthurZucker Jun 13, 2024

ydshieh Jun 13, 2024

ArthurZucker Jun 13, 2024

ydshieh Jun 13, 2024 •

edited

Loading

ArthurZucker Jun 13, 2024

ArthurZucker commented Jun 13, 2024

HuggingFaceDocBuilderDev commented Jun 13, 2024

ArthurZucker left a comment

ArthurZucker Jun 13, 2024

ydshieh commented Jun 13, 2024

ArthurZucker commented Jun 13, 2024

ydshieh Jun 13, 2024

ydshieh commented Jun 14, 2024

ArthurZucker left a comment

ArthurZucker commented Jun 18, 2024

		# Use hash to speed up the very slow operation `token not in added_tokens_decoder`.
		added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder}

	added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder}
	added_tokens_decoder_hash = {hash(token.__str__()) for token in self.added_tokens_decoder}

Improve PreTrainedTokenizerFast loading time when there are many added tokens #31404

Improve PreTrainedTokenizerFast loading time when there are many added tokens #31404

Conversation

ydshieh commented Jun 13, 2024 • edited Loading

What does this PR do?

Loading time (see code snippet below): 2048 added tokens

before this PR:

after this PR:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Jun 13, 2024

HuggingFaceDocBuilderDev commented Jun 13, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented Jun 13, 2024

ArthurZucker commented Jun 13, 2024

Choose a reason for hiding this comment

ydshieh commented Jun 14, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Jun 18, 2024

Improve `PreTrainedTokenizerFast` loading time when there are many added tokens #31404

Improve `PreTrainedTokenizerFast` loading time when there are many added tokens #31404

ydshieh commented Jun 13, 2024 •

edited

Loading

ydshieh Jun 13, 2024 •

edited

Loading