[2/n] SFTDataset: refactor slimorca and message converters #1270

RdoubleA · 2024-08-06T01:08:42Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

As discussed in the RFC in #1186, we will merge instruct and chat datasets to the following unified pipeline that can better support multimodal:

message_transform to create List[Message] from dataset with full flexibility on columns, ad-hoc modifications, etc. For multimodal, additionally images are loaded from the path
prompt_template as a optional way to add structured text around specific roles in the list of messages
model_transform that takes the list of messages and tokenizes it. For multimodal, it will additionally apply model-specific image transforms to the images associated with the sample

For ease of review, we will stage this as multiple moderate-sized PRs. This PR updates slimorca_dataset to use SFTDataset and refactors the message converters get_sharegpt_messages and get_openai_messages to their transform analogues: ShareGPTToMessages and JSONToMessages

Also renames _finetune.py to _sft.py

Previous PR: #1234

Test plan

update live docs and check rendering
unit tests for ShareGPTToMessages, JSONToMessages
updated unit test for slimorca_dataset
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
compare loss curves of slimorca to original versions on main

pytorch-bot · 2024-08-06T01:08:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1270

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6d84b31 with merge base 5c7246e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings · 2024-08-06T01:30:42Z

torchtune/datasets/_slimorca.py

-from torchtune.modules.tokenizers import ModelTokenizer
+from torchtune.datasets._finetune import SFTDataset
+from torchtune.datasets._packed import PackedDataset
+from torchtune.modules.transforms import Transform


 def slimorca_dataset(


This whole thing is so nice.

joecummings · 2024-08-06T01:38:39Z

torchtune/datasets/_slimorca.py

    Returns:
-        ChatDataset: dataset configured with SlimOrca source data and Llama2 chat template
+        Union[SFTDataset, PackedDataset]: dataset configured with SlimOrca source data


Can we use the | operator yet for this? Or is that only available in Python 3.11?

Like SFTDataset | PackedDataset

3.10 and above, and I think we test 3.9?

Grrr okay. Soon, soon.

codecov-commenter · 2024-08-06T01:39:51Z

Codecov Report

Attention: Patch coverage is 43.66197% with 40 lines in your changes missing coverage. Please review.

Project coverage is 27.04%. Comparing base (9fd5d01) to head (b58995b).

Files	Patch %	Lines
torchtune/data/_messages.py	26.08%	17 Missing ⚠️
tests/torchtune/data/test_messages.py	50.00%	12 Missing ⚠️
tests/torchtune/datasets/test_slimorca_dataset.py	46.66%	8 Missing ⚠️
torchtune/datasets/_slimorca.py	62.50%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1270       +/-   ##
===========================================
- Coverage   70.49%   27.04%   -43.46%     
===========================================
  Files         251      251               
  Lines       11596    11640       +44     
===========================================
- Hits         8175     3148     -5027     
- Misses       3421     8492     +5071

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SalmanMohammadi · 2024-08-06T07:36:24Z

Rafi shows me what a 10x engineer looks like, this is fresh clean

SalmanMohammadi · 2024-08-06T11:08:33Z

docs/source/api_ref_data.rst

+    InputOutputToMessages
+    ShareGPTToMessages
+    JSONToMessages
+
 Helper funcs


nit: Helper Functions??

ebsmothers

One comment, otherwise looks great

ebsmothers · 2024-08-06T13:51:01Z

torchtune/datasets/_slimorca.py

-            This value needs to be at least 4 though it is generally set to max sequence length accepted by the model.
-            Default is 1024.
+        model_transform (Transform): model specific transform to convert a list of messages
+            output by the dataset to tokens. This will always be a :class:`~torchtune.modules.tokenizers.ModelTokenizer`.


I commented this on the other PR (after it landed), but can you clarify why this always has to be a ModelTokenizer? And if that's the case, why don't we type it as such?

Discussed offline, keeping this as Transform to maintain consistency across SFTDataset, text dataset builders, multimodal dataset builders

RdoubleA added 23 commits July 22, 2024 18:44

initial commit

a3fe457

Merge branch 'main' into merged_dataset_1

9da786f

flesh out prompt templates

969909d

Merge branch 'main' into merged_dataset_1

c422a01

refactor samsum

ef79507

Merge branch 'main' into merged_dataset_1

5d2e7f5

add all tests, update live docs

7d54201

Merge branch 'main' into merged_dataset_1

062ff38

fix tests

df00fe1

move converters

f9f3174

change naming

4157dd7

Merge branch 'merged_dataset_1' into merged_dataset_2

ca64903

fix docstrings

d7c2246

fix recipe tests

ba2e2ec

Merge branch 'merged_dataset_1' into merged_dataset_2

364bff6

remove content.strip() in tokenizer

a531e48

Merge branch 'merged_dataset_1' into merged_dataset_2

4befcbf

fix test

c57a26a

Merge branch 'merged_dataset_1' into merged_dataset_2

67ad97a

Merge branch 'main' into merged_dataset_2

5d9cf3d

fix merge

9087027

clean up docstrings

1f930dd

add test

b58995b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 6, 2024

joecummings reviewed Aug 6, 2024

View reviewed changes

SalmanMohammadi reviewed Aug 6, 2024

View reviewed changes

ebsmothers approved these changes Aug 6, 2024

View reviewed changes

RdoubleA added 3 commits August 6, 2024 13:38

Merge branch 'main' into merged_dataset_2

b9b7431

nit on docs

22a65aa

rename _finetune to _sft

6d84b31

joecummings approved these changes Aug 6, 2024

View reviewed changes

RdoubleA merged commit 2014dd3 into pytorch:main Aug 6, 2024
29 checks passed

RdoubleA deleted the merged_dataset_2 branch August 6, 2024 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2/n] SFTDataset: refactor slimorca and message converters #1270

[2/n] SFTDataset: refactor slimorca and message converters #1270

RdoubleA commented Aug 6, 2024 •

edited

Loading

pytorch-bot bot commented Aug 6, 2024 •

edited

Loading

joecummings Aug 6, 2024

joecummings Aug 6, 2024

joecummings Aug 6, 2024

RdoubleA Aug 6, 2024

joecummings Aug 6, 2024

codecov-commenter commented Aug 6, 2024

SalmanMohammadi commented Aug 6, 2024

SalmanMohammadi Aug 6, 2024

ebsmothers left a comment

ebsmothers Aug 6, 2024

RdoubleA Aug 6, 2024

[2/n] SFTDataset: refactor slimorca and message converters #1270

[2/n] SFTDataset: refactor slimorca and message converters #1270

Conversation

RdoubleA commented Aug 6, 2024 • edited Loading

Context

Test plan

pytorch-bot bot commented Aug 6, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1270

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 6, 2024

Codecov Report

SalmanMohammadi commented Aug 6, 2024

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RdoubleA commented Aug 6, 2024 •

edited

Loading

pytorch-bot bot commented Aug 6, 2024 •

edited

Loading