Llama3 ChatFormat? #824

Broyojo · 2024-04-20T13:25:53Z

I've been trying to finetune Llama3 8b with a custom chat dataset, but there seems to be no Llama3 Chat Format class. How can I make a custom one or should I approach this a different way?

aldialimucaj · 2024-04-20T19:36:11Z

Is this what you need? https://github.com/pytorch/torchtune/blob/main/torchtune/datasets/_chat.py

RdoubleA · 2024-04-20T21:10:51Z

@Broyojo This is a great question. There is no required "chat format" in the same sense as LLaMA2 where you needed to format your prompt with instruct tags, as in the Llama2ChatFormat class:

[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>"

I am going to Paris, what should I see? [/INST] Paris, the capital of France, is known for its stunning architecture...

Instead, the tokenizer handles appending all the special tokens. If you look at the official LLaMA3 prompt format, it's quite different.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

If you look at our TikTokenTokenizer class, all of these ids are used as special tokens. So as long as you're using this tokenizer via tokenize_messages, like with the chat dataset class @aldialimucaj mentioned above, you don't need to pass in a chat format.

I am in the process of adding a tutorial to our documentation page very soon to explain these concepts, to help make things more clear.

musabgultekin · 2024-04-21T13:44:37Z

Okay I was banging my head to the walls for this issue today.

I used this component torchtune.datasets.chat_dataset . But it was requiring chat_format positional argument.
So I had to pass something. I dug into the repo then wasn't able to found the llama3's format. So I created my own Llama3ChatFormat. It worked. Then after digging into more, I realized that I'm basically wrapping header tokens with another special header tokens :S

Then I realized that tiktoken tokenizer class works differently then sentencepiece tokenizer class. (Sentencepiece tokenizer is kind of format agnostic, meanwhile tiktoken tokenizer has one rigid format, but they both take exactly same arguments)
To implement a new format for llama3, we have to create new tokenizer that implements Tokenizer.

I was thinking to go back to HF Trainer. But I spent too much time into this so I had to continue.

Then I created a dummy format for now:

class NoopChatFormat(ChatFormat):
    @classmethod
    def format(cls, sample: List[Message]) -> List[Message]:
        return sample

And appended to the file: torchtune/data/_chat_formats.py

And used this yaml config and it worked haha :)

# Dataset
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_style: openai
  chat_format: NoopChatFormat
  max_seq_len: 8192
  train_on_input: False
  split: train
  data_files: "conversations.jsonl"
seed: null
shuffle: True

(Note that "openai" conversation_style is something that I also implemented myself, I can open a PR for this)

I can open a PR for this NoopChatFormat OR, we can make "chat_format" argument optional. Let me know which way you want to proceed. @RdoubleA

musabgultekin · 2024-04-21T13:46:43Z

I think we should also document the local data file usage as a dataset. Cause it took me half an hour to configure properly.

This project is extremely promising. Keep up the great work!

RdoubleA · 2024-04-21T16:27:29Z

Thank you for the transparent feedback @musabgultekin, I'm sorry you had to struggle to get this to work. Truthfully, we designed the dataset classes a little too much around llama2 which DOES require a chat format as nothing is handled by the SentencePieceTokenizer, but llama3 moves all the formatting to the tokenizer.

The approach you ended up doing is exactly how we did it. On main branch, chat format should now be optional (thanks to @ebsmothers and @joecummings for anticipating this). If you clone the repo from main you should be able to use it without a chat format. Your NoOpFormat should also work as is.

Working with a local dataset is something that will be covered on the tutorial that's in progress, hoping to put it up early this week. If you haven't figured it out already, it's very similar to how you would configure it with load_dataset directly.

chat_dataset(source='csv', data_files='my_data.csv', ...)

I can open a PR in the meantime to clarify this in the docstrings.

Appreciate your patience on this and sticking through it! Let us know if there's any other way we can make this easier for you or others in the future.

Edit: and a PR for the OpenAI conversation style would be awesome, happy to take a look at that

jacklanda · 2024-04-23T08:22:18Z

Thank you for the transparent feedback @musabgultekin, I'm sorry you had to struggle to get this to work. Truthfully, we designed the dataset classes a little too much around llama2 which DOES require a chat format as nothing is handled by the SentencePieceTokenizer, but llama3 moves all the formatting to the tokenizer.

The approach you ended up doing is exactly how we did it. On main branch, chat format should now be optional (thanks to @ebsmothers and @joecummings for anticipating this). If you clone the repo from main you should be able to use it without a chat format. Your NoOpFormat should also work as is.

Working with a local dataset is something that will be covered on the tutorial that's in progress, hoping to put it up early this week. If you haven't figured it out already, it's very similar to how you would configure it with load_dataset directly.

chat_dataset(source='csv', data_files='my_data.csv', ...)

I can open a PR in the meantime to clarify this in the docstrings.

Appreciate your patience on this and sticking through it! Let us know if there's any other way we can make this easier for you or others in the future.

Edit: and a PR for the OpenAI conversation style would be awesome, happy to take a look at that

How about updating the simple usage instructions in the README or some other documents? So that everyone could follow it step-by-step to implement finetuning with their custom dataset.

RdoubleA · 2024-04-24T00:19:48Z

Thanks for your patience folks, I've just added a full tutorial on template differences between Llama2 and Llama3 and how to finetune Llama3 on a custom chat dataset here: https://pytorch.org/torchtune/main/tutorials/chat.html.

Hope that brings more clarity. Please do let me know if there's something that's not clear.

@jacklanda @musabgultekin @Broyojo

jacklanda · 2024-04-24T02:12:29Z

Thanks for your patience folks, I've just added a full tutorial on template differences between Llama2 and Llama3 and how to finetune Llama3 on a custom chat dataset here: https://pytorch.org/torchtune/main/tutorials/chat.html.

Hope that brings more clarity. Please do let me know if there's something that's not clear.

@jacklanda @musabgultekin @Broyojo

Great work! Thanks @RdoubleA so much!

musabgultekin · 2024-04-24T06:36:20Z

The new doc page looks really great thank you! Its much more clear now.
It's also incredible that chat_format argument issue fixed. Thanks @ebsmothers !

HaisongDing · 2024-04-24T13:14:14Z

@Broyojo This is a great question. There is no required "chat format" in the same sense as LLaMA2 where you needed to format your prompt with instruct tags, as in the Llama2ChatFormat class:
[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>"

I am going to Paris, what should I see? [/INST] Paris, the capital of France, is known for its stunning architecture...
Instead, the tokenizer handles appending all the special tokens. If you look at the official LLaMA3 prompt format, it's quite different.
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
If you look at our TikTokenTokenizer class, all of these ids are used as special tokens. So as long as you're using this tokenizer via tokenize_messages, like with the chat dataset class @aldialimucaj mentioned above, you don't need to pass in a chat format.

I am in the process of adding a tutorial to our documentation page very soon to explain these concepts, to help make things more clear.

If this is the case, I think the generator should stop when a eot_id is generated instead of eos_id here.

musabgultekin · 2024-04-24T16:27:49Z

Thats set for llama2. We probably need to add a conf something like stop_tokens (Since llama3 instruct have two).

ebsmothers · 2024-04-25T19:36:57Z

@HaisongDing @musabgultekin thanks for pointing out the multiple stop tokens. I just opened #871 to address this, will work on cleaning it up today so we have proper support here.

RdoubleA · 2024-04-29T17:41:02Z

Closing this as all user questions have been addressed.

MMM-J · 2024-05-15T18:05:20Z

For anyone getting a "module not found" error for the custom dataset when following the tutorial:

You need to "tune cp <recipe_name> ./<recipe_name>.py" and use that local recipe file in the "tune run ..." call, so that it resolves relative to the local directory.

kartikayk assigned RdoubleA Apr 21, 2024

xingyaoww mentioned this issue Apr 28, 2024

Support conversation_style of openai format (OpenAI API style) #890

Merged

11 tasks

RdoubleA closed this as completed Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3 ChatFormat? #824

Llama3 ChatFormat? #824

Broyojo commented Apr 20, 2024 •

edited

Loading

aldialimucaj commented Apr 20, 2024

RdoubleA commented Apr 20, 2024

musabgultekin commented Apr 21, 2024

musabgultekin commented Apr 21, 2024

RdoubleA commented Apr 21, 2024 •

edited

Loading

jacklanda commented Apr 23, 2024

RdoubleA commented Apr 24, 2024

jacklanda commented Apr 24, 2024

musabgultekin commented Apr 24, 2024 •

edited

Loading

HaisongDing commented Apr 24, 2024 •

edited

Loading

musabgultekin commented Apr 24, 2024

ebsmothers commented Apr 25, 2024

RdoubleA commented Apr 29, 2024

MMM-J commented May 15, 2024

Llama3 ChatFormat? #824

Llama3 ChatFormat? #824

Comments

Broyojo commented Apr 20, 2024 • edited Loading

aldialimucaj commented Apr 20, 2024

RdoubleA commented Apr 20, 2024

musabgultekin commented Apr 21, 2024

musabgultekin commented Apr 21, 2024

RdoubleA commented Apr 21, 2024 • edited Loading

jacklanda commented Apr 23, 2024

RdoubleA commented Apr 24, 2024

jacklanda commented Apr 24, 2024

musabgultekin commented Apr 24, 2024 • edited Loading

HaisongDing commented Apr 24, 2024 • edited Loading

musabgultekin commented Apr 24, 2024

ebsmothers commented Apr 25, 2024

RdoubleA commented Apr 29, 2024

MMM-J commented May 15, 2024

Broyojo commented Apr 20, 2024 •

edited

Loading

RdoubleA commented Apr 21, 2024 •

edited

Loading

musabgultekin commented Apr 24, 2024 •

edited

Loading

HaisongDing commented Apr 24, 2024 •

edited

Loading