Skip to content

Commit

Permalink
Improve PreTrainedTokenizerFast loading time when there are many ad…
Browse files Browse the repository at this point in the history
…ded tokens (#31404)

* use hash

* use hash

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
  • Loading branch information
2 people authored and itazap committed Jun 18, 2024
1 parent bb60771 commit 787ebd5
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion src/transformers/tokenization_utils_fast.py
Original file line number Diff line number Diff line change
@@ -173,10 +173,12 @@ def __init__(self, *args, **kwargs):
# allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens
# uses the information stored in `added_tokens_decoder`.
# this is costly for fast tokenizers as we re-compute the regex again. But not all tokens are added tokens
# Use hash to speed up the very slow operation `token not in added_tokens_decoder`.
added_tokens_decoder_hash = {hash(repr(token)) for token in self.added_tokens_decoder}
tokens_to_add = [
token
for index, token in sorted(added_tokens_decoder.items(), key=lambda x: x[0])
if token not in self.added_tokens_decoder
if hash(repr(token)) not in added_tokens_decoder_hash
]
encoder = list(self.added_tokens_encoder.keys()) + [str(token) for token in tokens_to_add]
# if some of the special tokens are strings, we check if we don't already have a token

0 comments on commit 787ebd5

Please sign in to comment.