-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding SlimOrca Dataset to the datasets collection #116
Conversation
✅ Deploy Preview for torchtune-preview ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
d621e6d
to
443b13a
Compare
torchtune/datasets/slimorca.py
Outdated
def transform( | ||
self, prompt: str, instruction: str, response: str | ||
) -> Tuple[List[int], List[int]]: | ||
# Add instruction and response tags to construct the input string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking to get feedback about these tags. I just followed the Alpaca dataset approach but I am unsure if that is correct. Cc @rohan-varma @joecummings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there's a canonical reference implementation, but here's one that may be worth checking out: https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/finetuning/data_utils.py#L285
torchtune/datasets/slimorca.py
Outdated
def transform( | ||
self, prompt: str, instruction: str, response: str | ||
) -> Tuple[List[int], List[int]]: | ||
# Add instruction and response tags to construct the input string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there's a canonical reference implementation, but here's one that may be worth checking out: https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/finetuning/data_utils.py#L285
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, will stamp once there's logs of training w/loss decreasing. Also, general q, how do we want to think about unittesting for this dataset and datasets in general?
torchtune/datasets/slimorca.py
Outdated
prompt + instruction_tag + instruction + response_tag | ||
) | ||
labels = self._tokenizer.encode(response) | ||
return instructions_and_inputs, labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, although curious what other libraries such as HF / lit-gpt do here. might be useful to check lit-gpt and see which label, target pairs they return.
@rohan-varma @joecummings @ebsmothers @pbontrager Would like your opinion on this: So far from reading a bit about HF chat templates (https://huggingface.co/docs/transformers/main/en/chat_templating), it looks like the template for formatting the training sample is usually tied to how a model is pre-trained. Llama recommends a particular format - https://fburl.com/83srjcjj (though there are other formats that have also worked well - https://www.pinecone.io/learn/llama-2/. So in this case, I plan to change the formatting to the one prescribed by llama. I might have to copy/paste sections of code from llama generation.py above to massage the data in SlimOrca to the format suggested by llama. But this also means that we need to switch the template if we use a different pre-trained model. Basically my question is, in cases like Alpaca dataset, is the instruction, prompt, response formatting selected by the dataset or the pre-trained model? That is, will the current alpaca dataset checked-in work out of the box for a non-llama pre-trained model? |
Minor side note, @joecummings I noticed that in lit-gpt, prepare-alpaca, the label contains the the full prompt+instruction+input+response - https://fburl.com/eko93dyj instead of just the response. Is there any reason to go one way vs another (also for the lit-gpt approach, is the model expected to output the entire prompt+instruction+input+response)? |
Great catch! The instruction, prompt, response formatting is selected from the dataset; HOWEVER, the chat template is selected from the pre-trained model. This includes things like One possible solution is to add an API to our |
d7fdd9f
to
d8e283c
Compare
d8e283c
to
d744fe2
Compare
@rohan-varma Please take a look when you get a chance. Also do recommend if I should add a new recipe for slimorca. Here is the loss values over a bunch of iterations - P1011481434. One thing that I want to call out is the training doesn't complete and fails midway either due to GPU OOM or because the seq length of input exceeds max seq length. There are few options in this case: i) Pretokenize the dataset in init method and drop the sample that is beyond max_seq_length (this will lead to a long time for first batch) We can start with (ii) and implement (iii). Or if we want to add the pre-prep option as we discussed, we can use this as a motivation to add option (i). |
let's discuss this w/ @pbontrager but for now, I think we should be mostly unblocked with this in terms of validating correctness in the same recipe. Ideally we should eventually have a separate recipe, yes Either (i) or (ii) sound good to me re: exceeding sequence length. Pretokenization is interesting, I wonder if this will prohibitively increase TTFB and/or if pretokenized results can be cached on disk to help this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LG, but let's also add unittests (i.e. the data format output is expected) and docs/doc rendering? thanks!
720 samples have input len longer than 4096 out of 363000 samples (0.1%) |
91d7c56
to
1983024
Compare
@rohan-varma @kartikayk Can you take another look? I moved truncation logic to this PR as suggested here #213 (comment), also added unit tests, added loss progress to the description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome! Have some questions / comments. Also in general, notice some APIs like prompt_with_system
are implemented but not used or testing, AFAICT. Pls ensure all APIs we're exposing are unittested.
Docstrings, doc rendering will need to be part of this PR as well.
Thanks for the review @rohan-varma! Addressed all comments, please take a look again. |
b0074fb
to
c6a7e3b
Compare
torchtune/datasets/slimorca.py
Outdated
from torchtune.modules import Tokenizer | ||
|
||
|
||
class Llama2ChatFormatConstants: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
this meant to be public? It it is, it should be documented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG overall. Will stamp once we have the relevant documentation, thanks!
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@rohan-varma I have ensured the API reference docs torchtune.datasets (from docstrings) render properly as part of this PR (similar to torchtune.models/modules). Are you referring to different documentation? If yes, is there an example of docs that you can share that I can use as an example to follow for this PR? |
I can confirm the docs look good - thanks Gokul. I'm still wondering about #116 (comment) though, the comment was resolved but I don't see this class being either private or documented. @gokulavasan @rohan-varma @kartikayk on a separate note: the tests seem to take quite a long time to run: about 20 seconds on my laptop (time on CI seem to be in the same ballpark). 20 seconds just for testing a single dataset seems quite long and that time will add up very fast. Comparatively, testing a dataset in torchvision takes ~1s. |
I added docstring but let me convert it to private class as we don't want to expose/generalize it just yet. Cc @rohan-varma |
4e60361
to
97f536b
Compare
This looks really good overall. I would like in the future for all datasets to match a protocol design and use the same method names but we can leave that to future work on dataset generalization. Also to Nicolas's note on testing, I think we should think hard about how we can test these without downloading the datasets. I think that rethinking the unit tests can also be deferred to when we have more than two datasets here though. |
Thanks for the pointer @pbontrager. For this particular dataset unit test, created a GI #238 as a follow up. |
Changelog
Test plan
1|61|Loss: 1.058661937713623: 0%| | 60/181746 [02:35<146:53:52, 2.91s/it]