-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve PreTrainedTokenizerFast
loading time when there are many added tokens
#31404
Conversation
# Use hash to speed up the very slow operation `token not in added_tokens_decoder`. | ||
added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a very very tiny chance of hash collision. Do we want to address that possibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no 😉
# Use hash to speed up the very slow operation `token not in added_tokens_decoder`. | ||
added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no 😉
@@ -172,10 +172,12 @@ def __init__(self, *args, **kwargs): | |||
# allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens | |||
# uses the information stored in `added_tokens_decoder`. | |||
# this is costly for fast tokenizers as we re-compute the regex again. But not all tokens are added tokens | |||
# Use hash to speed up the very slow operation `token not in added_tokens_decoder`. | |||
added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder} | |
added_tokens_decoder_hash = {hash(token.__str__()) for token in self.added_tokens_decoder} |
would that be even faster? hash the string rep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we implement this at the class level I am wondering if it is not faster? WOuld be computed when you init the object most probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's on par :-) OK for me to both, but if using str, I would just do it str(token)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you need to make sure normalized, left, right and single word is not the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str
and __str__
will give the same thing. Do you mean to use __repr__
?
That will give
'AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)'
while str
and __str__
will give, for example,
> </s>
note str
will call __str__
under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sorry repr
Super nice BTW |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it on par to have the init at the AddedTokens class level? (make sure to not import it from tokenizers)
@@ -172,10 +172,12 @@ def __init__(self, *args, **kwargs): | |||
# allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens | |||
# uses the information stored in `added_tokens_decoder`. | |||
# this is costly for fast tokenizers as we re-compute the regex again. But not all tokens are added tokens | |||
# Use hash to speed up the very slow operation `token not in added_tokens_decoder`. | |||
added_tokens_decoder_hash = {hash(token) for token in self.added_tokens_decoder} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you need to make sure normalized, left, right and single word is not the same
sorry, I don't understand this. Could you elaborate a bit more? |
What I mean is implement the hash for this class |
@@ -172,10 +172,12 @@ def __init__(self, *args, **kwargs): | |||
# allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens | |||
# uses the information stored in `added_tokens_decoder`. | |||
# this is costly for fast tokenizers as we re-compute the regex again. But not all tokens are added tokens | |||
# Use hash to speed up the very slow operation `token not in added_tokens_decoder`. | |||
added_tokens_decoder_hash = {hash(repr(token)) for token in self.added_tokens_decoder} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repr
used
@ArthurZucker Everything is addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! Thanks for this update
huggingface/tokenizers#1521 a related PR |
…ded tokens (#31404) * use hash * use hash * update --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…ded tokens (#31404) * use hash * use hash * update --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
What does this PR do?
The condition check
is very slow, especially when there are more added tokens.
Loading time (see code snippet below): 2048 added tokens
before this PR:
3.909789 seconds
after this PR:
0.23008 seconds