-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llama] AutoTokenizer does not add eos_token
at the end
#23833
Comments
add eos_token
at the endeos_token
at the end
Hi, Note that it doesn't make sense to pass In the code snippet above,
Pinging @ArthurZucker regarding the |
Thank you so much for explaining this ~~~ |
Hey! fast = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", add_eos_token=True, from_slow=True) This will produce the expected outputs: >>> fast.encode("auto_tokenizer", add_special_tokens = True)
[1, 4469, 29918, 6979, 3950, 2] The reason behind this is that the I'll open a PR to make sure that changing the eos and bos update the processor. Thanks for reporting. |
For transformers v4.35.0, |
Hello, this seems to work fine for me: >>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> tokenizer.encode("</s>", add_special_tokens = False))
>>> tokenizer.encode("Hey</s>sir", add_special_tokens = False)
>>> tokenizer.encode("Hey</s>sir", add_special_tokens = False)
[18637, 2, 8889]
>>> tokenizer.tokenize("Hey</s>", add_special_tokens = False)
['▁Hey', '</s>'] For such an important model we try to fix this as soon as possible as it can impact training for example, would mind sharing a reproducer ? 🤗 |
I don't have access to Transformers is installed via import transformers
print(transformers.__version__) # 4.35.0.dev0
from transformers import AutoTokenizer
s = 'huggyllama/llama-7b'
s = "NousResearch/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(s)
print(tokenizer.encode("</s>", add_special_tokens = False)) # [2]
print(tokenizer.tokenize("</s>", add_special_tokens = False)) # ['▁</s>']
print(tokenizer.encode("Hey</s>sir", add_special_tokens = False)) # [18637, 829, 29879, 29958, 29879, 381]
print(tokenizer.tokenize("Hey</s>sir", add_special_tokens = False)) # ['▁Hey', '</', 's', '>', 's', 'ir'] |
That's expected if they did not update the |
Thanks for the information. Just wondering what is the correct normalisation? I tried setting |
>>> tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf", from_slow=True)
>>> print(tokenizer.tokenize("Hey</s>sir", add_special_tokens = False))
['▁Hey', '</s>', '▁sir']
>>> from transformers import AddedToken
>>> tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
>>> tokenizer.add_tokens(AddedToken("</s>", normalized=False, special=True), special_tokens=True)
>>> tokenizer.save_pretrained("/tmp/tokenizer-llama")
>>> tokenizer = AutoTokenizer.from_pretrained("/tmp/tokenizer-llama")
>>> print(tokenizer.tokenize("Hey</s>sir", add_special_tokens = False))
['▁Hey', '</s>', '▁sir'] That is because fast tokenizers are supposed to be fixed after initialization. I'm planning on supporting the update without having to save/load the tokenizer but this was never possible before either. |
It works! Thanks! |
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.29.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
code:
results:
Expected behavior
add eos token like:
The text was updated successfully, but these errors were encountered: