-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2/n] SFTDataset: refactor slimorca and message converters #1270
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1270
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6d84b31 with merge base 5c7246e (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
from torchtune.modules.tokenizers import ModelTokenizer | ||
from torchtune.datasets._finetune import SFTDataset | ||
from torchtune.datasets._packed import PackedDataset | ||
from torchtune.modules.transforms import Transform | ||
|
||
|
||
def slimorca_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole thing is so nice.
Returns: | ||
ChatDataset: dataset configured with SlimOrca source data and Llama2 chat template | ||
Union[SFTDataset, PackedDataset]: dataset configured with SlimOrca source data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the |
operator yet for this? Or is that only available in Python 3.11?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like SFTDataset | PackedDataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3.10 and above, and I think we test 3.9?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grrr okay. Soon, soon.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1270 +/- ##
===========================================
- Coverage 70.49% 27.04% -43.46%
===========================================
Files 251 251
Lines 11596 11640 +44
===========================================
- Hits 8175 3148 -5027
- Misses 3421 8492 +5071 ☔ View full report in Codecov by Sentry. |
Rafi shows me what a 10x engineer looks like, this is fresh clean |
docs/source/api_ref_data.rst
Outdated
InputOutputToMessages | ||
ShareGPTToMessages | ||
JSONToMessages | ||
|
||
Helper funcs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Helper Functions??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment, otherwise looks great
This value needs to be at least 4 though it is generally set to max sequence length accepted by the model. | ||
Default is 1024. | ||
model_transform (Transform): model specific transform to convert a list of messages | ||
output by the dataset to tokens. This will always be a :class:`~torchtune.modules.tokenizers.ModelTokenizer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented this on the other PR (after it landed), but can you clarify why this always has to be a ModelTokenizer
? And if that's the case, why don't we type it as such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, keeping this as Transform to maintain consistency across SFTDataset, text dataset builders, multimodal dataset builders
Context
What is the purpose of this PR? Is it to
As discussed in the RFC in #1186, we will merge instruct and chat datasets to the following unified pipeline that can better support multimodal:
message_transform
to createList[Message]
from dataset with full flexibility on columns, ad-hoc modifications, etc. For multimodal, additionally images are loaded from the pathprompt_template
as a optional way to add structured text around specific roles in the list of messagesmodel_transform
that takes the list of messages and tokenizes it. For multimodal, it will additionally apply model-specific image transforms to the images associated with the sampleFor ease of review, we will stage this as multiple moderate-sized PRs. This PR updates
slimorca_dataset
to useSFTDataset
and refactors the message convertersget_sharegpt_messages
andget_openai_messages
to their transform analogues:ShareGPTToMessages
andJSONToMessages
Also renames
_finetune.py
to_sft.py
Previous PR: #1234
Test plan
ShareGPTToMessages
,JSONToMessages
slimorca_dataset
pytest tests
pytest tests -m integration_test