Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interleaved image support in tokenizers #1138

Merged
merged 31 commits into from
Jul 4, 2024

Conversation

RdoubleA
Copy link
Contributor

@RdoubleA RdoubleA commented Jul 2, 2024

Adding support for image special tokens in tokenizers and Message dataclass across the library. Starting with Llama3Tokenizer as an example, as multimodal Flamingo will be using this.

This PR will enabled support for multiple images per sample, interleaved throughout the text. This is done by overhauling how content is represented in the Message class:

[
    Message(role="user", content=[{"type": "image"}, {"type": "text", "content":"hello"}, ...]),
    ...
]

For backwards compatibility, passing a string into the content field will automatically get casted into the list of dictionaries. Using dictionaries is visually clunky, but it enables the most flexibility for any non-text content that needs to be tokenized differently and future metadata fields that may affect tokenization.

We must also use a different paradigm for chat formats. Now, treat it as extra tags that are prepended/appended to the Message content list based on the role:

for message in sample:
      content = (
          [{"type": "text", "content": cls.template[message.role][0]}]
          + message.content
          + [{"type": "text", "content": cls.template[message.role][1]}]
      )
      formatted_dialogue.append(
          Message(
              role=message.role,
              content=content,
              masked=message.masked,
              ipython=message.ipython,
              eot=message.eot,
          ),
      )

Changelog

  • Overhaul to Message class. Update Message content field to be multimodal - can have text or image content in a list of dictionaries. Added properties for easily checking if there's image content or getting the text only content
  • Update all chat format tests
  • Update all chat formats to use the new Message content structure. Now we simply prepend and append tags as needed
  • If Message content contains type==image, then add image special token in Llama3Tokenizer

Test Plan

Compared to a reference implementation token by token. Made this into new unit tests for text only messages, text + image, text + interleaved image, tool calls

Other considered approaches

There are other approaches I considered for enabling interleaved images in a message. The important workflow to consider is that users will have to use our utilities or create their own transforms to get their datasets into our Message format with interleaved image content. We should make this workflow as easy as possible.

1

Use an enum, Media.IMAGE, and make Message.content a List[Union[str, Media]] that combines texts and places images in between strings

[
    Message(role="user", content=[Media.IMAGE, "hello", Media.IMAGE, "world"]),
    ...
]

Pros: slightly less tedious to convert user datasets into this structure and also process this (versus dictionaries)
Cons: not easy to support image metadata or other fields that may affect image tokenization in the future

2

Keep everything as a string and allow special tokens in the text itself. Tokenizer splits on the indicated image token and adds the correct special token id and encodes the strings normally

[
    Message(role="user", content="<img> hello <img> world", media={"image": "<img>"})
    ...
]

Pros: More prompt template / chat format friendly, as we can still format the string as is. In other approaches, since the text is broken up, prompt templating is considerably more difficult. Easier to process string only content, no impact on text-only workflows
Cons: we are breaking our rule of no special tokens in the text

Copy link

pytorch-bot bot commented Jul 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1138

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit fca9031 with merge base f158577 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2024
@RdoubleA RdoubleA changed the title [WIP] Image/tooling support in tokenizers [WIP] Interleaved image support in tokenizers Jul 2, 2024
@@ -14,7 +14,7 @@

from torchtune import config, utils
from torchtune.config._utils import _get_component_from_path
from torchtune.data import ChatFormat, InstructTemplate, Message
from torchtune.data import apply_chat_format, ChatFormat, InstructTemplate, Message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copycat copycat copycat

@RdoubleA RdoubleA changed the title [WIP] Interleaved image support in tokenizers Interleaved image support in tokenizers Jul 3, 2024
@RdoubleA RdoubleA marked this pull request as ready for review July 3, 2024 02:09
@codecov-commenter
Copy link

codecov-commenter commented Jul 3, 2024

Codecov Report

Attention: Patch coverage is 94.70199% with 8 lines in your changes missing coverage. Please review.

Project coverage is 66.14%. Comparing base (f158577) to head (7da4189).

Files Patch % Lines
tests/test_utils.py 89.65% 3 Missing ⚠️
torchtune/data/_types.py 90.00% 2 Missing ⚠️
torchtune/models/llama3/_tokenizer.py 96.42% 1 Missing ⚠️
torchtune/models/phi3/_tokenizer.py 80.00% 1 Missing ⚠️
torchtune/modules/tokenizers/_utils.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1138      +/-   ##
==========================================
+ Coverage   65.98%   66.14%   +0.15%     
==========================================
  Files         194      194              
  Lines        9023     9042      +19     
==========================================
+ Hits         5954     5981      +27     
+ Misses       3069     3061       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the design based on our discussions. Have a handful of comments but no huge concerns. Can you also fix the doc build job failure and update the PR summary to reflect the design we settled on here?

torchtune/data/_types.py Show resolved Hide resolved
tests/torchtune/datasets/test_chat_dataset.py Show resolved Hide resolved
tests/torchtune/models/llama3/test_llama3_tokenizer.py Outdated Show resolved Hide resolved
tests/torchtune/models/llama3/test_llama3_tokenizer.py Outdated Show resolved Hide resolved
torchtune/models/llama3/_tokenizer.py Show resolved Hide resolved
torchtune/models/llama3/_tokenizer.py Show resolved Hide resolved
torchtune/models/llama3/_tokenizer.py Outdated Show resolved Hide resolved
torchtune/data/_chat_formats.py Outdated Show resolved Hide resolved
torchtune/data/_chat_formats.py Show resolved Hide resolved
@RdoubleA RdoubleA merged commit b317c8f into pytorch:main Jul 4, 2024
29 checks passed
maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants