-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama3 ChatFormat? #824
Comments
Is this what you need? https://github.com/pytorch/torchtune/blob/main/torchtune/datasets/_chat.py |
@Broyojo This is a great question. There is no required "chat format" in the same sense as LLaMA2 where you needed to format your prompt with instruct tags, as in the Llama2ChatFormat class:
Instead, the tokenizer handles appending all the special tokens. If you look at the official LLaMA3 prompt format, it's quite different.
If you look at our TikTokenTokenizer class, all of these ids are used as special tokens. So as long as you're using this tokenizer via tokenize_messages, like with the chat dataset class @aldialimucaj mentioned above, you don't need to pass in a chat format. I am in the process of adding a tutorial to our documentation page very soon to explain these concepts, to help make things more clear. |
Okay I was banging my head to the walls for this issue today. I used this component Then I realized that tiktoken tokenizer class works differently then sentencepiece tokenizer class. (Sentencepiece tokenizer is kind of format agnostic, meanwhile tiktoken tokenizer has one rigid format, but they both take exactly same arguments) I was thinking to go back to HF Trainer. But I spent too much time into this so I had to continue. Then I created a dummy format for now: class NoopChatFormat(ChatFormat):
@classmethod
def format(cls, sample: List[Message]) -> List[Message]:
return sample And appended to the file: And used this yaml config and it worked haha :)
(Note that "openai" conversation_style is something that I also implemented myself, I can open a PR for this) I can open a PR for this NoopChatFormat OR, we can make "chat_format" argument optional. Let me know which way you want to proceed. @RdoubleA |
I think we should also document the local data file usage as a dataset. Cause it took me half an hour to configure properly. This project is extremely promising. Keep up the great work! |
Thank you for the transparent feedback @musabgultekin, I'm sorry you had to struggle to get this to work. Truthfully, we designed the dataset classes a little too much around llama2 which DOES require a chat format as nothing is handled by the SentencePieceTokenizer, but llama3 moves all the formatting to the tokenizer. The approach you ended up doing is exactly how we did it. On main branch, chat format should now be optional (thanks to @ebsmothers and @joecummings for anticipating this). If you clone the repo from main you should be able to use it without a chat format. Your NoOpFormat should also work as is. Working with a local dataset is something that will be covered on the tutorial that's in progress, hoping to put it up early this week. If you haven't figured it out already, it's very similar to how you would configure it with load_dataset directly.
I can open a PR in the meantime to clarify this in the docstrings. Appreciate your patience on this and sticking through it! Let us know if there's any other way we can make this easier for you or others in the future. Edit: and a PR for the OpenAI conversation style would be awesome, happy to take a look at that |
How about updating the simple usage instructions in the README or some other documents? So that everyone could follow it step-by-step to implement finetuning with their custom dataset. |
Thanks for your patience folks, I've just added a full tutorial on template differences between Llama2 and Llama3 and how to finetune Llama3 on a custom chat dataset here: https://pytorch.org/torchtune/main/tutorials/chat.html. Hope that brings more clarity. Please do let me know if there's something that's not clear. |
Great work! Thanks @RdoubleA so much! |
The new doc page looks really great thank you! Its much more clear now. |
If this is the case, I think the generator should stop when a |
Thats set for llama2. We probably need to add a conf something like stop_tokens (Since llama3 instruct have two). |
@HaisongDing @musabgultekin thanks for pointing out the multiple stop tokens. I just opened #871 to address this, will work on cleaning it up today so we have proper support here. |
Closing this as all user questions have been addressed. |
For anyone getting a "module not found" error for the custom dataset when following the tutorial: You need to "tune cp <recipe_name> ./<recipe_name>.py" and use that local recipe file in the "tune run ..." call, so that it resolves relative to the local directory. |
I've been trying to finetune Llama3 8b with a custom chat dataset, but there seems to be no Llama3 Chat Format class. How can I make a custom one or should I approach this a different way?
The text was updated successfully, but these errors were encountered: