-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for unstructured text corpus datasets for CPT #868
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/868
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 55a6de3 with merge base 29ae975 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool, do we in general plan to support CPT in torchtune and plan for an RFC around this possibly?
I don't know that CPT will be sufficiently different from our finetune recipes to warrant its own, besides different data handling. @rohan-varma would you know :) or maybe @pbontrager ? |
I don't think so initially. I think this dataset is good and we could add a recipe in the future if there are additional custom things that people want for advanced use cases. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #868 +/- ##
===========================================
- Coverage 67.10% 27.20% -39.91%
===========================================
Files 174 180 +6
Lines 7423 7518 +95
===========================================
- Hits 4981 2045 -2936
- Misses 2442 5473 +3031 ☔ View full report in Codecov by Sentry. |
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
column="article", | ||
max_seq_len=max_seq_len, | ||
split="train", | ||
name="3.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "name"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me add a comment, but it's basically specifying the subset of data
Any update here? I want to train on https://huggingface.co/datasets/allenai/dolma soon. |
return tokens, labels | ||
|
||
|
||
def text_completion_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry one last nit: Can you add an examples section like this?
torchtune/torchtune/datasets/_chat.py
Line 137 in 29ae975
Examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll make the docs and usage much clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses: #845, #809
Continued pre-training involves an identical data processing pipeline as standard pre-training, where the model simply predicts the next token and completes the text. It does not require any templating or prompt formatting. Our existing dataset classes are specifically designed for instruct and chat tuning, but don't support free-form text corpuses.
Here, we add a dataset class
TextDataset
that simply callsload_dataset
and tokenizes the text directly without any further processing. It usesencode
which simply adds BOS/EOS tokens if needed. This should be compatible with llama2, llama3, and other model's formatting requirements.Changelog
What are the changes made in this PR?
TextDataset
and appropriate testscnn_dailymail_articles_dataset
and appropriate testsTest plan
Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)
pre-commit install
)pytest tests
pytest tests -m integration_test
Ran the full finetune distributed recipe with the cnn dailymail dataset.