Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3 ChatFormat? #824

Closed
Broyojo opened this issue Apr 20, 2024 · 14 comments
Closed

Llama3 ChatFormat? #824

Broyojo opened this issue Apr 20, 2024 · 14 comments
Assignees

Comments

@Broyojo
Copy link

Broyojo commented Apr 20, 2024

I've been trying to finetune Llama3 8b with a custom chat dataset, but there seems to be no Llama3 Chat Format class. How can I make a custom one or should I approach this a different way?

@aldialimucaj
Copy link

@RdoubleA
Copy link
Contributor

@Broyojo This is a great question. There is no required "chat format" in the same sense as LLaMA2 where you needed to format your prompt with instruct tags, as in the Llama2ChatFormat class:

[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>"

I am going to Paris, what should I see? [/INST] Paris, the capital of France, is known for its stunning architecture...

Instead, the tokenizer handles appending all the special tokens. If you look at the official LLaMA3 prompt format, it's quite different.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

If you look at our TikTokenTokenizer class, all of these ids are used as special tokens. So as long as you're using this tokenizer via tokenize_messages, like with the chat dataset class @aldialimucaj mentioned above, you don't need to pass in a chat format.

I am in the process of adding a tutorial to our documentation page very soon to explain these concepts, to help make things more clear.

@musabgultekin
Copy link
Contributor

Okay I was banging my head to the walls for this issue today.

I used this component torchtune.datasets.chat_dataset . But it was requiring chat_format positional argument.
So I had to pass something. I dug into the repo then wasn't able to found the llama3's format. So I created my own Llama3ChatFormat. It worked. Then after digging into more, I realized that I'm basically wrapping header tokens with another special header tokens :S

Then I realized that tiktoken tokenizer class works differently then sentencepiece tokenizer class. (Sentencepiece tokenizer is kind of format agnostic, meanwhile tiktoken tokenizer has one rigid format, but they both take exactly same arguments)
To implement a new format for llama3, we have to create new tokenizer that implements Tokenizer.

I was thinking to go back to HF Trainer. But I spent too much time into this so I had to continue.

Then I created a dummy format for now:

class NoopChatFormat(ChatFormat):
    @classmethod
    def format(cls, sample: List[Message]) -> List[Message]:
        return sample

And appended to the file: torchtune/data/_chat_formats.py

And used this yaml config and it worked haha :)

# Dataset
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_style: openai
  chat_format: NoopChatFormat
  max_seq_len: 8192
  train_on_input: False
  split: train
  data_files: "conversations.jsonl"
seed: null
shuffle: True

(Note that "openai" conversation_style is something that I also implemented myself, I can open a PR for this)

I can open a PR for this NoopChatFormat OR, we can make "chat_format" argument optional. Let me know which way you want to proceed. @RdoubleA

@musabgultekin
Copy link
Contributor

I think we should also document the local data file usage as a dataset. Cause it took me half an hour to configure properly.

This project is extremely promising. Keep up the great work!

@RdoubleA
Copy link
Contributor

RdoubleA commented Apr 21, 2024

Thank you for the transparent feedback @musabgultekin, I'm sorry you had to struggle to get this to work. Truthfully, we designed the dataset classes a little too much around llama2 which DOES require a chat format as nothing is handled by the SentencePieceTokenizer, but llama3 moves all the formatting to the tokenizer.

The approach you ended up doing is exactly how we did it. On main branch, chat format should now be optional (thanks to @ebsmothers and @joecummings for anticipating this). If you clone the repo from main you should be able to use it without a chat format. Your NoOpFormat should also work as is.

Working with a local dataset is something that will be covered on the tutorial that's in progress, hoping to put it up early this week. If you haven't figured it out already, it's very similar to how you would configure it with load_dataset directly.

chat_dataset(source='csv', data_files='my_data.csv', ...)

I can open a PR in the meantime to clarify this in the docstrings.

Appreciate your patience on this and sticking through it! Let us know if there's any other way we can make this easier for you or others in the future.

Edit: and a PR for the OpenAI conversation style would be awesome, happy to take a look at that

@jacklanda
Copy link

Thank you for the transparent feedback @musabgultekin, I'm sorry you had to struggle to get this to work. Truthfully, we designed the dataset classes a little too much around llama2 which DOES require a chat format as nothing is handled by the SentencePieceTokenizer, but llama3 moves all the formatting to the tokenizer.

The approach you ended up doing is exactly how we did it. On main branch, chat format should now be optional (thanks to @ebsmothers and @joecummings for anticipating this). If you clone the repo from main you should be able to use it without a chat format. Your NoOpFormat should also work as is.

Working with a local dataset is something that will be covered on the tutorial that's in progress, hoping to put it up early this week. If you haven't figured it out already, it's very similar to how you would configure it with load_dataset directly.

chat_dataset(source='csv', data_files='my_data.csv', ...)

I can open a PR in the meantime to clarify this in the docstrings.

Appreciate your patience on this and sticking through it! Let us know if there's any other way we can make this easier for you or others in the future.

Edit: and a PR for the OpenAI conversation style would be awesome, happy to take a look at that

How about updating the simple usage instructions in the README or some other documents? So that everyone could follow it step-by-step to implement finetuning with their custom dataset.

@RdoubleA
Copy link
Contributor

Thanks for your patience folks, I've just added a full tutorial on template differences between Llama2 and Llama3 and how to finetune Llama3 on a custom chat dataset here: https://pytorch.org/torchtune/main/tutorials/chat.html.

Hope that brings more clarity. Please do let me know if there's something that's not clear.

@jacklanda @musabgultekin @Broyojo

@jacklanda
Copy link

Thanks for your patience folks, I've just added a full tutorial on template differences between Llama2 and Llama3 and how to finetune Llama3 on a custom chat dataset here: https://pytorch.org/torchtune/main/tutorials/chat.html.

Hope that brings more clarity. Please do let me know if there's something that's not clear.

@jacklanda @musabgultekin @Broyojo

Great work! Thanks @RdoubleA so much!

@musabgultekin
Copy link
Contributor

musabgultekin commented Apr 24, 2024

The new doc page looks really great thank you! Its much more clear now.
It's also incredible that chat_format argument issue fixed. Thanks @ebsmothers !

@HaisongDing
Copy link

HaisongDing commented Apr 24, 2024

@Broyojo This is a great question. There is no required "chat format" in the same sense as LLaMA2 where you needed to format your prompt with instruct tags, as in the Llama2ChatFormat class:

[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>"

I am going to Paris, what should I see? [/INST] Paris, the capital of France, is known for its stunning architecture...

Instead, the tokenizer handles appending all the special tokens. If you look at the official LLaMA3 prompt format, it's quite different.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

If you look at our TikTokenTokenizer class, all of these ids are used as special tokens. So as long as you're using this tokenizer via tokenize_messages, like with the chat dataset class @aldialimucaj mentioned above, you don't need to pass in a chat format.

I am in the process of adding a tutorial to our documentation page very soon to explain these concepts, to help make things more clear.

If this is the case, I think the generator should stop when a eot_id is generated instead of eos_id here.

@musabgultekin
Copy link
Contributor

Thats set for llama2. We probably need to add a conf something like stop_tokens (Since llama3 instruct have two).

@ebsmothers
Copy link
Contributor

@HaisongDing @musabgultekin thanks for pointing out the multiple stop tokens. I just opened #871 to address this, will work on cleaning it up today so we have proper support here.

@RdoubleA
Copy link
Contributor

Closing this as all user questions have been addressed.

@MMM-J
Copy link

MMM-J commented May 15, 2024

For anyone getting a "module not found" error for the custom dataset when following the tutorial:

You need to "tune cp <recipe_name> ./<recipe_name>.py" and use that local recipe file in the "tune run ..." call, so that it resolves relative to the local directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants