-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Behaviour between slow and fast LLaMa tokenizer not equivalent #23889
Comments
Thanks for reporting, will have a look |
Okay, what's happening here is that you are adding tokens that are already present in the vocabulary of the model.
|
Reproduced is still working for the latest version of transformers because you are relying on adding the token, which should be ignored but is not. The content in rust is modified . |
(this will update the processor) |
Thanks for taking a look! However I'm using the latest version of Transformers, have added |
Reproduced on main branch, here's a Colab notebook: https://colab.research.google.com/drive/1KA_mliTsvjnhOCO3SApVJkgVd2HEeVQZ?usp=sharing. |
Actually, with fast tokenizers there is no logic to properly update the template processor if it exists. The default has always been to initialize the model with the correct tokens. Meaning: |
Hmm ok so there's no way to have an equivalent fast tokenizer that makes the script above pass? The reason is that for the new InstructBLIP model (#23460), the processor class ( |
No way no. I am not in favor of introducing a very hacky behaviour while the fix should be in rust in that case. from transformers import LlamaTokenizer, LlamaTokenizerFast
import torch
fast_tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", truncation_side="left", bos_token = "</s>", unk_token = "</s>")
fast_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b", truncation_side="left", bos_token = "</s>", unk_token = "</s>")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
prompt = "What is unusual about this image?"
encoding = tokenizer(prompt, return_tensors="pt")
fast_encoding = fast_tokenizer(prompt, return_tensors="pt")
for k,v in encoding.items():
assert torch.allclose(fast_encoding[k], v) |
Also once you have a tokenizer ready, you can save it and it should have the correct postProcessor |
Ok thanks a lot, now works fine and I can use the fast tokenizer. |
Hello, ArthurZucker. Thank you for the great package and maintenance. I wander this notice message while I trained llama2
|
The above message is pretty much unrelated, and should help you improve the performances when padding the input. Should work alright! 😉 |
System Info
Transformers v4.29.2
Who can help?
@ArthurZucker
Reproduction
For a new model (#23460), I'd like to get equivalent behaviour between the slow and fast LLaMa tokenizers. The code of the slow tokenizer was taken from the original code, and now I'd like to translate this to the fast tokenizer as well.
However, as can be seen below, behaviour is not equivalent:
Expected behavior
I'd expect that the assertion above passes.
The text was updated successfully, but these errors were encountered: