-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chat dataset + SlimOrca refactor + more templates #576
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/576
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e215272 with merge base 6bc450c (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
❌ Deploy Preview for torchtune-preview failed.
|
91aa4ad
to
b33c3c9
Compare
torchtune/config/_utils.py
Outdated
except InstantiationError: | ||
# Verify that string can be used as a template, should have variable | ||
# placeholders | ||
pattern = r"\{.+?\}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the most robust validation? E.g. I think \{hello\}
will pass but is not a valid template. Not to mention that we are not validating # of args or anything like that. Not a huge deal cause I know config validation is hard, but just wanna be realistic about how much we can accomplish with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is {hello}
not a valid template? It technically is with the variable placeholder hello
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> hi = "\{hello\}"
>>> hi.format(hello='a')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'hello\\'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError: if the template is not a PromptTemplate class or a proper | ||
template string | ||
""" | ||
path = "torchtune.data." + template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, isn't this different from our usual instantiate logic? Why the change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is different from instantiate because it is working with the string directly instead of a DictConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Personally I find that a little bit confusing, but I guess we don't expose this in configs anyways, right? (At least in the current form)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - this is strictly for the dataset builders. this method was originally in datasets/utils but I moved it to config since it was more akin to config functionality
908da3b
to
182ca4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see the improved chat dataset support! Left a bunch of comments but no major concerns from my side.
182ca4a
to
e215272
Compare
Context
Chat and conversational data is one of the most common datasets that OSS users want to fine-tune on. Including tools and abstractions that empower users to quickly configure their own chat dataset without the overhead of data preprocessing can be immensely valuable.
The challenge here is designing an API that is general enough to apply to many chat datasets but not too rigid that it adds friction to the developer workflow. This is what I would primarily like early feedback on. You can see an example of how it generalizes with the
slimorca_dataset
builder.Challenge: Conversational data can take many different formats and it's difficult to anticipate most or all of them
This is the biggest hurdle, but if we engineer a well-designed solution it would make users' lives significantly easier, or at least provide strong guidelines for how to customize to their own dataset. The approach we take here is to define a few lightweight abstractions:
These are not new ideas, these were taken straight from Meta's llama inference repo. Axolotl also does something similar. We need to enforce a particular format so that other components can be easily designed around this assumption, and it's not entirely unreasonable to place the burden on users to format their data in this way. This tradeoff is preferable to designing for ANY type of conversation format, or multiple branching if-else statements.
The user will need to do this via
convert_to_dialogue
, a mandatory Callable parameter. The contract is pretty clear: process aSample
and return aDialogue
. You can see an example in thesharegpt_to_llama_dialogue
transform. Users may typically want to transform their data anyway as a preprocessing step before templating and tokenization; this parameter simply takes the place of that.Challenge: Multi-turn conversations
Handling multiple turns requires template each turn individually, and simultaneously respecting max sequence length, which can easily lead to a convoluted for loop. I think the approach here ended up being relatively straight-forward, but I need feedback here to see if I missed any edge cases.
Challenge: Interop with sample packing
This is still something I'm working through, so it is TBD.
Changelog
ChatDataset
abstraction and unit testsSlimOrcaDataset
->slimorca_dataset
builderLlama2ChatTemplate
,MistralChatTemplate
,ChatMLTemplate
tokenize_prompt_and_response
,truncate_if_necessary
torchtune/data/
Test plan
All unit tests and integration tests pass:
pytest tests --with-integration
E2E test with a recipe: TODO