Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llava: add default chat templates #31691

Merged
merged 34 commits into from
Jul 19, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
f577ecb
add default chat templates
zucchini-nlp Jun 28, 2024
1831329
Update src/transformers/models/llava/processing_llava.py
zucchini-nlp Jun 28, 2024
d684157
Update src/transformers/models/llava_next/processing_llava_next.py
zucchini-nlp Jun 28, 2024
a22c149
more clear docstring and docs
zucchini-nlp Jun 28, 2024
c4c1880
Update docs/source/en/model_doc/llava.md
zucchini-nlp Jun 28, 2024
ac07a33
Update docs/source/en/model_doc/llava_next.md
zucchini-nlp Jun 28, 2024
41afdbd
Update docs/source/en/model_doc/vipllava.md
zucchini-nlp Jun 28, 2024
07a91be
add tests
zucchini-nlp Jun 28, 2024
400a8b2
remove default templates (see #31733)
zucchini-nlp Jul 10, 2024
2be86fc
load chat template from another file
zucchini-nlp Jul 10, 2024
3ff974f
Merge branch 'main' into chat_templates
zucchini-nlp Jul 10, 2024
3ad481b
Update docs/source/en/model_doc/llava_next.md
zucchini-nlp Jul 12, 2024
0f0b8a2
revert some changes in docs
zucchini-nlp Jul 12, 2024
2a3df50
forgot vipllava
zucchini-nlp Jul 12, 2024
dd7aad9
Merge branch 'huggingface:main' into chat_templates
zucchini-nlp Jul 15, 2024
7c215bd
chat template file is not temporary hack
zucchini-nlp Jul 16, 2024
78a5876
Merge remote-tracking branch 'upstream/main' into chat_templates
zucchini-nlp Jul 16, 2024
f9e47b8
Merge remote-tracking branch 'upstream/main' into chat_templates
zucchini-nlp Jul 17, 2024
b4deec5
warn if loading from processor
zucchini-nlp Jul 17, 2024
6abf2d6
not that file
zucchini-nlp Jul 17, 2024
8ef3e2e
similarly modify `save_pretrained`
zucchini-nlp Jul 17, 2024
eba4512
Update tests/models/llava_next/test_processor_llava_next.py
zucchini-nlp Jul 17, 2024
62c4ac6
Update tests/models/vipllava/test_processor_vipllava.py
zucchini-nlp Jul 17, 2024
ac01cdd
Update docs/source/en/model_doc/vipllava.md
zucchini-nlp Jul 17, 2024
109e198
Update src/transformers/processing_utils.py
zucchini-nlp Jul 17, 2024
54451dc
Update src/transformers/processing_utils.py
zucchini-nlp Jul 17, 2024
11fc70c
Update docs/source/en/model_doc/vipllava.md
zucchini-nlp Jul 17, 2024
5cb21ec
Update docs/source/en/model_doc/llava.md
zucchini-nlp Jul 17, 2024
a127f1d
Update docs/source/en/model_doc/llava.md
zucchini-nlp Jul 17, 2024
4850a3b
Update docs/source/en/model_doc/llava_next.md
zucchini-nlp Jul 17, 2024
ca2696f
Update docs/source/en/model_doc/llava_next.md
zucchini-nlp Jul 17, 2024
97b227c
Update src/transformers/processing_utils.py
zucchini-nlp Jul 17, 2024
0809d57
Update docs/source/en/model_doc/llava_next.md
zucchini-nlp Jul 17, 2024
6e2153f
fix
zucchini-nlp Jul 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 32 additions & 22 deletions docs/source/en/model_doc/llava.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,28 +40,38 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/

- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.

- For better results, we recommend users to prompt the model with the correct prompt format. Below is a list of prompt formats accepted by each llava checkpoint:

[llava-interleave models](https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19) requires the following format:
```bash
"<|im_start|>user <image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant"
```

For multiple turns conversation:

```bash
"<|im_start|>user <image>\n<prompt1><|im_end|><|im_start|>assistant <answer1><|im_end|><|im_start|>user <image>\n<prompt1><|im_end|><|im_start|>assistant "
```

[llava-1.5 models](https://huggingface.co/collections/llava-hf/llava-15-65f762d5b6941db5c2ba07e0) requires the following format:
```bash
"USER: <image>\n<prompt> ASSISTANT:"
```

For multiple turns conversation:

```bash
"USER: <image>\n<prompt1> ASSISTANT: <answer1></s>USER: <prompt2> ASSISTANT: <answer2></s>USER: <prompt3> ASSISTANT:"
- For better results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:

```python
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What’s shown in this image?"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "This image shows a red stop sign."},]
},
{

"role": "user",
"content": [
{"type": "text", "text": "Decsribe the image in more details."},
],
},
]
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved

text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images
print(text_prompt)
```

### Using Flash Attention 2
Expand Down
96 changes: 77 additions & 19 deletions docs/source/en/model_doc/llava_next.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,25 +46,42 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/

- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.

- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. Below, we list the correct prompt formats to use for the text prompt "What is shown in this image?":
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use the processor's `apply_chat_template` to format your prompts correctly. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities. Below is an example of how to do that.
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved

[llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) requires the following format:
We will use [llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-hf/llava-v1.6-mistral-7b-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows:

```bash
"[INST] <image>\nWhat is shown in this image? [/INST]"
```
```python
from transformers import LlavaNextProcessor

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-hf/llava-v1.6-mistral-7b-hf")

conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What’s shown in this image?"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "This image shows a red stop sign."},]
},
{

"role": "user",
"content": [
{"type": "text", "text": "Decsribe the image in more details."},
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
],
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
},
]

[llava-v1.6-vicuna-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf) and [llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) require the following format:
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

```bash
"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\nWhat is shown in this image? ASSISTANT:"
# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images
print(text_prompt)
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
```

[llava-v1.6-34b-hf](https://huggingface.co/llava-hf/llava-v1.6-34b-hf) requires the following format:

```bash
"<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n"
```

## Usage example

Expand All @@ -86,8 +103,17 @@ model.to("cuda:0")
# prepare image and text prompt, using the appropriate prompt template
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"

conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

# autoregressively complete prompt
Expand Down Expand Up @@ -120,15 +146,47 @@ image_cats = Image.open(requests.get(url, stream=True).raw)
url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
image_snowman = Image.open(requests.get(url, stream=True).raw)

# Prepare a batched prompt, where the first one is a multi-turn conversation and the second is not
prompt = [
"[INST] <image>\nWhat is shown in this image? [/INST] There is a red stop sign in the image. [INST] <image>\nWhat about this image? How many cats do you see [/INST]",
"[INST] <image>\nWhat is shown in this image? [/INST]"
# Prepare a batch of two prompts, where the first one is a multi-turn conversation and the second is not
conversation_1 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "There is a red stop sign in the image."},
],
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What about this image? How many cats do you see?"},
],
},
]

conversation_2 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]

prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True)
prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)
prompts = [prompt_1, prompt_2]

# We can simply feed images in the order they have to be used in the text prompt
# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokens
inputs = processor(text=prompt, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(model.device)
inputs = processor(text=prompts, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(model.device)

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
Expand Down
42 changes: 32 additions & 10 deletions docs/source/en/model_doc/vipllava.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,38 @@ Tips:

- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.

- For better results, we recommend users to prompt the model with the correct prompt format:

```bash
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n<prompt>###Assistant:
```

For multiple turns conversation:

```bash
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n<prompt1>###Assistant: <answer1>###Human: <prompt2>###Assistant:
- For better results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:

```python
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("llava-hf/vip-llava-7b-hf")

conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What’s shown in this image?"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "This image shows a red stop sign."},]
},
{

"role": "user",
"content": [
{"type": "text", "text": "Decsribe the image in more details."},
],
},
]
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved

text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
print(text_prompt)
```

The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).
Expand Down
32 changes: 32 additions & 0 deletions src/transformers/processing_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -554,8 +554,11 @@ def get_processor_dict(

pretrained_model_name_or_path = str(pretrained_model_name_or_path)
is_local = os.path.isdir(pretrained_model_name_or_path)
chat_template_file = None
if os.path.isdir(pretrained_model_name_or_path):
processor_file = os.path.join(pretrained_model_name_or_path, PROCESSOR_NAME)
chat_template_file = os.path.join(pretrained_model_name_or_path, "chat_template.json")

if os.path.isfile(pretrained_model_name_or_path):
resolved_processor_file = pretrained_model_name_or_path
is_local = True
Expand All @@ -564,6 +567,7 @@ def get_processor_dict(
resolved_processor_file = download_url(pretrained_model_name_or_path)
else:
processor_file = PROCESSOR_NAME
chat_template_file = "chat_template.json"
try:
# Load from local folder or from cache or download from model Hub and cache
resolved_processor_file = cached_file(
Expand All @@ -580,6 +584,23 @@ def get_processor_dict(
subfolder=subfolder,
_raise_exceptions_for_missing_entries=False,
)

# load chat template from a separate json if exists
# TODO @raushan: remove this hack after a few major releases
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do for two reasons:

  • Older versions of transformers will still break as processors won't accept the chat template in the config. We can't assume people will have the newest versions installed.
  • People may have local copies of chat_template.json after this update. Even if we updated all public checkpoints on the hub to have the templates in the config, this would break things for anyone who is using local checkpoint as even custom templates.

FWIW - I think it's a lot cleaner having the chat template in a separate file anyway. They can be verbose and can easily clutter up the config files. Similar to keeping the vocab files separate for the tokenizers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I thought we could deprecate the chat_template.json when the old versions are very old to be in someone's env. I am okay with leaving chat_template.json but wouldn't that be different from what we have in LLMs?

Another option I proposed was to add the template in processor's tokenizer. If we don't want to have the template in processor.config, this will be the better option IMO. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older versions of transformers will still break as processors won't accept the chat template in the config.

Oh, also, continuing on this. We can also discuss this internally on slack later. We recently added an option for some processors to accept any kwargs and I was hoping to start integrating new kwargs for VLMs. Does this comment mean that we can't do it and will need another hack to load those kwargs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I thought we could deprecate the chat_template.json when the old versions are very old to be in someone's env

The problem is - when would this happen? There's some people who are still installing transformers v3, so we can't assume everyone will be working from the latest versions, even if it's a few months out. We can deprecate things within the code which we have full control over. Unfortunately, as the configs sit outside the library it means people can use them from the first version they're introduced until now.

Another option I proposed was to add the template in processor's tokenizer. If we don't want to have the template in processor.config, this will be the better option IMO. WDYT?

Just to be make sure we're talking about the same thing, this would mean adding into the tokenizer's config for the checkpoint e.g. the tokenizer config here for a llava checkpoint?

It's a solution, but I don't think it semantically makes sense: the tokenizer is for processing text, whereas this introduces information about images.

We recently added an option for some processors to accept any kwargs and I was hoping to start integrating new kwargs for VLMs. Does this comment mean that we can't do it and will need another hack to load those kwargs?

We definitely want this feature added for our processors, and gives us flexibility with future compatibility. It won't however fix things for older versions. My understanding is that if the chat_template gets added into the config, then older versions would still break when trying to load the config file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a solution, but I don't think it semantically makes sense: the tokenizer is for processing text, whereas this introduces information about images.

Yes, I meant that checkpoint. Now I see why it's not in tokenizer's config, then this workaround of having a separate json for templates is indeed better.

It won't however fix things for older versions. My understanding is that if the chat_template gets added into the config, then older versions would still break when trying to load the config file.

Yeah, this will be the problem in any case as we won't have compatibility with older transformers versions given how breaking are the new changes. Dealing with hub-transformers compatibility is harder than it looked like 😅 I guess for that feature of VLM processor refactor we'll have to wait a while and try to change slowly, to see users reactions...

resolved_chat_template_file = cached_file(
pretrained_model_name_or_path,
chat_template_file,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
resume_download=resume_download,
local_files_only=local_files_only,
token=token,
user_agent=user_agent,
revision=revision,
subfolder=subfolder,
_raise_exceptions_for_missing_entries=False,
)
except EnvironmentError:
# Raise any environment error raise by `cached_file`. It will have a helpful error message adapted to
# the original exception.
Expand All @@ -593,6 +614,14 @@ def get_processor_dict(
f" directory containing a {PROCESSOR_NAME} file"
)

# add chat template as kwarg before returning below because most models don't have processor config
chat_template = None
if resolved_chat_template_file is not None:
with open(resolved_chat_template_file, "r", encoding="utf-8") as reader:
text = reader.read()
chat_template = json.loads(text)["chat_template"]
Comment on lines +637 to +638
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for doing this in two steps i.e. getting text then json.loads instead of json.load in the reader context directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, will simplify it. I was copying from processor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't be loaded as json because it was saved as TextIOWrapper. Actually I don't know why we save it this way, was copying from processors. Maybe it has something to do with safe-saving 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, interesting. Well, good to know :) Thanks for investigating

kwargs["chat_template"] = chat_template

# Existing processors on the Hub created before #27761 being merged don't have `processor_config.json` (if not
# updated afterward), and we need to keep `from_pretrained` work. So here it fallbacks to the empty dict.
# (`cached_file` called using `_raise_exceptions_for_missing_entries=False` to avoid exception)
Expand Down Expand Up @@ -647,6 +676,7 @@ def from_args_and_dict(cls, args, processor_dict: Dict[str, Any], **kwargs):
"""
processor_dict = processor_dict.copy()
return_unused_kwargs = kwargs.pop("return_unused_kwargs", False)
chat_template = kwargs.pop("chat_template", None)

# Unlike image processors or feature extractors whose `__init__` accept `kwargs`, processor don't have `kwargs`.
# We have to pop up some unused (but specific) arguments to make it work.
Expand All @@ -657,6 +687,8 @@ def from_args_and_dict(cls, args, processor_dict: Dict[str, Any], **kwargs):
del processor_dict["auto_map"]

processor = cls(*args, **processor_dict)
if chat_template is not None:
setattr(processor, "chat_template", chat_template)

# Update processor with kwargs if needed
for key in set(kwargs.keys()):
Expand Down
17 changes: 17 additions & 0 deletions tests/models/llava/test_processor_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,20 @@ def test_can_load_various_tokenizers(self):
processor = LlavaProcessor.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
self.assertEqual(processor.tokenizer.__class__, tokenizer.__class__)

def test_chat_template(self):
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
expected_prompt = "USER: <image>\nWhat is shown in this image? ASSISTANT:"

messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]

formatted_prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
self.assertEquals(expected_prompt, formatted_prompt)
41 changes: 41 additions & 0 deletions tests/models/llava_next/test_processor_llava_next.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Copyright 2021 The HuggingFace Team. All rights reserved.
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest

from transformers.testing_utils import require_vision
from transformers.utils import is_vision_available


if is_vision_available():
from transformers import AutoProcessor


@require_vision
class LlavaProcessorTest(unittest.TestCase):
def test_chat_template(self):
processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-vicuna-7b-hf")
expected_prompt = "USER: <image>\nWhat is shown in this image? ASSISTANT:"

messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]

formatted_prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
self.assertEquals(expected_prompt, formatted_prompt)
Loading
Loading