Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoTokenizer not enforcing use_fast=True #20817

Closed
stas00 opened this issue Dec 18, 2022 · 9 comments
Closed

AutoTokenizer not enforcing use_fast=True #20817

stas00 opened this issue Dec 18, 2022 · 9 comments
Assignees

Comments

@stas00
Copy link
Contributor

stas00 commented Dec 18, 2022

This issue is about AutoTokenizer not enforcing use_fast=True.

This works:

$ python -c "from transformers import AutoTokenizer; t=AutoTokenizer.from_pretrained('facebook/opt-13b', use_fast=True); \
assert t.is_fast, 'tokenizer is not fast'; print('Success')" 
Success

now the same code, but a different model 'facebook/opt-1.3b' that doesn't have a fast optimizer:

$ python -c "from transformers import AutoTokenizer; t=AutoTokenizer.from_pretrained('facebook/opt-1.3b', use_fast=True); \
assert t.is_fast, 'tokenizer is not fast'; print('Success')" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: tokenizer is not fast

now the doc says:

use_fast (bool, optional, defaults to True) — Whether or not to try to load the fast version of the tokenizer.

so it sort of hints with "try to load" that it won't enforce it. But would you be open to a less ambiguous definition? something like:

use_fast (bool, optional, defaults to True) — Will try to load the fast version of the tokenizer if there is one and 
will quietly fallback onto the normal (slower) tokenizer if the model doesn't provide a fast one.

I think the use_fast arg name is ambiguous - I'd have renamed it to try_to_use_fast since currently if one must use the fast tokenizer one has to additionally check that that AutoTokenizer.from_pretrained returned the slow version.

not sure, open to suggestions.

context: in m4 the codebase currently requires a fast tokenizer.

Thank you!

cc: @ArthurZucker

@sgugger
Copy link
Collaborator

sgugger commented Dec 19, 2022

The name has been around for so long that we won't change it. It's not ideal but it is what is 🤷‍♂️ We can definitely improve the documentation however!

Unrelated: why does OPT not create the fast tokenizer on the fly from the slow one @ArthurZucker ? This seems like abug.

@ArthurZucker
Copy link
Collaborator

It is indeed a bug and people seem to be confused. IMO we should add a warning when use_fast is set to True but a fast tokenizer does not exists. Will have a look at why OPT does not create the fast tokenizer 😉

@stas00
Copy link
Contributor Author

stas00 commented Dec 19, 2022

If you have to use a warnings in this situation it's a sign that API needs to be improved. Warnings rarely work as there are dozens/hundreds of them emitted by most applications and a user is unlikely to notice it. That's just my experience-based opinion, of course.

If the old name can't be deprecated, I'd leave it alone and update the doc as a I suggested in the OP and add a new arg require_fast=True which would assert if the requirement can't be met. So the first one is preference, the second one is a requirement. That would make for an unambiguous yet flexible API.

Unrelated: why does OPT not create the fast tokenizer on the fly from the slow one @ArthurZucker ? This seems like abug.

some of the OPT models do and some don't, you can see in the OP both examples are OPT models.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Dec 20, 2022

Agreed, the problem is now the inconsistency between two models. If it is only OPT related we can leave it as is, otherwise will have a look

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Dec 20, 2022

It is indeed a bug, the facebook/opt-1.3b tokenizer config is missing the tokenizer_type variable. And the use_fast argument is not passed down properly in that case. The fix is here #20823

@huggingface huggingface deleted a comment from github-actions bot Jan 18, 2023
@huggingface huggingface deleted a comment from github-actions bot Feb 12, 2023
@ArthurZucker ArthurZucker reopened this Feb 13, 2023
@stas00
Copy link
Contributor Author

stas00 commented Mar 9, 2023

so where are we with this Issue, @ArthurZucker? Thank you!

As it will get closed by the stale bot.

@huggingface huggingface deleted a comment from github-actions bot Mar 9, 2023
@sgugger
Copy link
Collaborator

sgugger commented Mar 9, 2023

I think the doc has been updated and the OPT model where there was a problem has been fixed, so the issue is ready to be closed no?

@ArthurZucker
Copy link
Collaborator

Yes, I re-opened it because I thought we should probably raise and error if the tokenizer is not fast, but feel free to close.

@sgugger
Copy link
Collaborator

sgugger commented Mar 9, 2023

As was said before here, either raising an error or renaming the argument would be too much of a breaking change for something that has been around for three years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants