Behaviour between slow and fast LLaMa tokenizer not equivalent #23889

NielsRogge · 2023-05-31T08:03:26Z

System Info

Transformers v4.29.2

Who can help?

Reproduction

For a new model (#23460), I'd like to get equivalent behaviour between the slow and fast LLaMa tokenizers. The code of the slow tokenizer was taken from the original code, and now I'd like to translate this to the fast tokenizer as well.

However, as can be seen below, behaviour is not equivalent:

from transformers import LlamaTokenizer, LlamaTokenizerFast
import torch

tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b", truncation_side="left")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
tokenizer.add_special_tokens({"bos_token": "</s>"})
tokenizer.add_special_tokens({"eos_token": "</s>"})
tokenizer.add_special_tokens({"unk_token": "</s>"})

fast_tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", truncation_side="left")
fast_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
fast_tokenizer.add_special_tokens({"bos_token": "</s>"})
fast_tokenizer.add_special_tokens({"eos_token": "</s>"})
fast_tokenizer.add_special_tokens({"unk_token": "</s>"})

prompt = "What is unusual about this image?"

encoding = tokenizer(prompt, return_tensors="pt")

fast_encoding = fast_tokenizer(prompt, return_tensors="pt")

for k,v in encoding.items():
    assert torch.allclose(fast_encoding[k], v)
=> this assertion fails since the input_ids differ:

tensor([[    2,  1724,   338, 22910,  1048,   445,  1967, 29973]])
tensor([[    1,  1724,   338, 22910,  1048,   445,  1967, 29973]])

Expected behavior

I'd expect that the assertion above passes.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-05-31T09:44:15Z

Thanks for reporting, will have a look

ArthurZucker · 2023-06-07T12:56:58Z

Okay, what's happening here is that you are adding tokens that are already present in the vocabulary of the model.
</s> is 2.

fast: When you add the bos_token it is not added as it already exist, but the content is updated with the new value for the fast tokenizer.
slow: the token id is properly updated, but the post_processor is not. This was fixed in [LlamaTokenizerFast] nit update post_processor on the fly #23855

ArthurZucker · 2023-06-07T12:59:44Z

Reproduced is still working for the latest version of transformers because you are relying on adding the token, which should be ignored but is not. The content in rust is modified .
Use this:
fast_tokenizer.bos_token = "</s>"

ArthurZucker · 2023-06-07T12:59:54Z

(this will update the processor)

NielsRogge · 2023-06-08T20:34:51Z

Thanks for taking a look!

However I'm using the latest version of Transformers, have added fast_tokenizer.bos_token = "</s>", but the assertion still fails for me.

NielsRogge · 2023-06-08T20:38:41Z

Reproduced on main branch, here's a Colab notebook: https://colab.research.google.com/drive/1KA_mliTsvjnhOCO3SApVJkgVd2HEeVQZ?usp=sharing.

ArthurZucker · 2023-06-09T06:53:57Z

Actually, with fast tokenizers there is no logic to properly update the template processor if it exists. The default has always been to initialize the model with the correct tokens. Meaning:
fast_tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", truncation_side="left", bos_token=“”)
Is what you should be using. The template processor gets updated only if you change “add_bos” and “add_eos” otherwise the logic is a bit complicated, we have to overload parent setters to bos_token as well as bos_token_id to update the template processing. Not in favor of that so leaving as is, will improve the doc for changing bos and eos in fast

NielsRogge · 2023-06-09T07:06:45Z

Hmm ok so there's no way to have an equivalent fast tokenizer that makes the script above pass?

The reason is that for the new InstructBLIP model (#23460), the processor class (InstructBlipProcessor) would normally use the AutoTokenizer class to load files from the hub. And as the AutoTokenizer API uses the fast tokenizer by default, I'm currently not getting equivalent results as when I use the slow one.

ArthurZucker · 2023-06-09T07:14:35Z

No way no. I am not in favor of introducing a very hacky behaviour while the fix should be in rust in that case.
The following works:

from transformers import LlamaTokenizer, LlamaTokenizerFast
import torch

fast_tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", truncation_side="left", bos_token = "</s>", unk_token = "</s>")
fast_tokenizer.add_special_tokens({"pad_token": "[PAD]"})


tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b", truncation_side="left", bos_token = "</s>", unk_token = "</s>")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
prompt = "What is unusual about this image?"

encoding = tokenizer(prompt, return_tensors="pt")

fast_encoding = fast_tokenizer(prompt, return_tensors="pt")

for k,v in encoding.items():
    assert torch.allclose(fast_encoding[k], v)

ArthurZucker · 2023-06-09T07:14:55Z

Also once you have a tokenizer ready, you can save it and it should have the correct postProcessor

NielsRogge · 2023-06-09T09:41:15Z

Ok thanks a lot, now works fine and I can use the fast tokenizer.

dsdanielpark · 2023-10-06T21:26:21Z

Hello, ArthurZucker.

Thank you for the great package and maintenance.
I wanted to inquire whether llama2's fast tokenizer is currently functioning correctly based on the above code. Or still have a problem?

I wander this notice message while I trained llama2

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding

ArthurZucker · 2023-10-07T06:18:52Z

The above message is pretty much unrelated, and should help you improve the performances when padding the input. Should work alright! 😉

ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label May 31, 2023

NielsRogge mentioned this issue Jun 5, 2023

Add InstructBLIP #23460

Merged

5 tasks

ArthurZucker closed this as completed Jun 7, 2023

NielsRogge reopened this Jun 8, 2023

ArthurZucker mentioned this issue Jun 9, 2023

[lamaTokenizerFast] Update documentation #24132

Merged

NielsRogge closed this as completed Jun 9, 2023

ArthurZucker mentioned this issue Jun 12, 2023

LLaMATokenizerFast works abnormally #23818

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviour between slow and fast LLaMa tokenizer not equivalent #23889

Behaviour between slow and fast LLaMa tokenizer not equivalent #23889

NielsRogge commented May 31, 2023 •

edited

Loading

ArthurZucker commented May 31, 2023

ArthurZucker commented Jun 7, 2023

ArthurZucker commented Jun 7, 2023

ArthurZucker commented Jun 7, 2023

NielsRogge commented Jun 8, 2023 •

edited

Loading

NielsRogge commented Jun 8, 2023

ArthurZucker commented Jun 9, 2023

NielsRogge commented Jun 9, 2023

ArthurZucker commented Jun 9, 2023 •

edited

Loading

ArthurZucker commented Jun 9, 2023

NielsRogge commented Jun 9, 2023

dsdanielpark commented Oct 6, 2023 •

edited

Loading

ArthurZucker commented Oct 7, 2023

Behaviour between slow and fast LLaMa tokenizer not equivalent #23889

Behaviour between slow and fast LLaMa tokenizer not equivalent #23889

Comments

NielsRogge commented May 31, 2023 • edited Loading

System Info

Who can help?

Reproduction

Expected behavior

ArthurZucker commented May 31, 2023

ArthurZucker commented Jun 7, 2023

ArthurZucker commented Jun 7, 2023

ArthurZucker commented Jun 7, 2023

NielsRogge commented Jun 8, 2023 • edited Loading

NielsRogge commented Jun 8, 2023

ArthurZucker commented Jun 9, 2023

NielsRogge commented Jun 9, 2023

ArthurZucker commented Jun 9, 2023 • edited Loading

ArthurZucker commented Jun 9, 2023

NielsRogge commented Jun 9, 2023

dsdanielpark commented Oct 6, 2023 • edited Loading

ArthurZucker commented Oct 7, 2023

NielsRogge commented May 31, 2023 •

edited

Loading

NielsRogge commented Jun 8, 2023 •

edited

Loading

ArthurZucker commented Jun 9, 2023 •

edited

Loading

dsdanielpark commented Oct 6, 2023 •

edited

Loading