Add Support for `revision` Dataset Parameter to specify reading from Huggingface Dataset Revision #1912

thomascleberg · 2024-09-12T21:42:46Z

Adds support for an optional dataset revision parameter.

Motivation and Context

This feature would add support from loading from a particular Huggingface dataset revision - that is, pr or commit hash. This is an important feature of datasets on Huggingface Hub, it's illustrated in the 2nd example of load_dataset.

Some dataset management strategies or workflows involve using these revisions as "candidates", so that a pull request would be merged when a successful experiment is completed using it. In this case, we need to be able to specify the revision of the dataset.

How has this been tested?

Wrote two unit tests. Tested in current pipelines to read training data from existing revision. Tested without the parameter and ensure that it read from main.

Types of changes

Parameters to HF data loading.

Social Handles (Optional)

@tcleberg

winglian · 2024-09-14T04:07:21Z

@thomascleberg , isn't revision specific to datasets loaded from huggingface hub? datasets loaded from local disk shouldn't need this so we probably shouldn't add this to some of the load_dataset calls just to be safe, otherwise, this PR is a great addition!

tcleberg · 2024-09-14T17:46:13Z

@thomascleberg , isn't revision specific to datasets loaded from huggingface hub? datasets loaded from local disk shouldn't need this so we probably shouldn't add this to some of the load_dataset calls just to be safe, otherwise, this PR is a great addition!

Yep, perfectly reasonable!

thomascleberg · 2024-09-18T15:56:50Z

@winglian what does the remaining process to get this merged look like?

thomascleberg · 2024-10-09T17:06:23Z

yo - @winglian can you please help me understand what the process is from this point? We have to stop using this project if I can't get this merged imminently.

winglian · 2024-10-11T14:03:46Z

@thomascleberg sorry! just now circling back on my end on this. We're working through getting a process on being more responsive on PRs. @NanoCode012 just joined full time with us so that should help.

NanoCode012

Hey, sorry, I was meaning to get to this PR this week but got blocked by another PR earlier today.

This one should be ready to go. The only missing part would be to extend for pretraining_dataset but that can be a separate PR.

thomascleberg mentioned this pull request Sep 12, 2024

Add Support for Loading a Specific Dataset Revision #1911

Closed

5 tasks

winglian force-pushed the feature/enable-huggingface-dataset-revision branch from 94f19fc to 1160b5f Compare September 14, 2024 17:03

tcleberg approved these changes Sep 15, 2024

View reviewed changes

thomascleberg requested a review from tcleberg October 1, 2024 22:03

winglian added the ready to merge label Oct 11, 2024

thomascleberg and others added 4 commits October 11, 2024 11:04

Add support for revision dataset parameter

363f7e0

only use revision on hf hub backed datasets

01aed9d

use revision tied to head

b29ccb8

set download to use revision

a41f8d3

winglian force-pushed the feature/enable-huggingface-dataset-revision branch from f68bf5c to a41f8d3 Compare October 11, 2024 15:04

NanoCode012 added 2 commits October 11, 2024 22:12

feat: add config to model validator class

744b993

feat: add revision config to RL and tests for it

48836ae

NanoCode012 approved these changes Oct 11, 2024

View reviewed changes

winglian merged commit e73b8df into axolotl-ai-cloud:main Oct 11, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for `revision` Dataset Parameter to specify reading from Huggingface Dataset Revision #1912

Add Support for `revision` Dataset Parameter to specify reading from Huggingface Dataset Revision #1912

thomascleberg commented Sep 12, 2024

winglian commented Sep 14, 2024

tcleberg commented Sep 14, 2024

thomascleberg commented Sep 18, 2024

thomascleberg commented Oct 9, 2024

winglian commented Oct 11, 2024

NanoCode012 left a comment

Add Support for revision Dataset Parameter to specify reading from Huggingface Dataset Revision #1912

Add Support for revision Dataset Parameter to specify reading from Huggingface Dataset Revision #1912

Conversation

thomascleberg commented Sep 12, 2024

Motivation and Context

How has this been tested?

Types of changes

Social Handles (Optional)

winglian commented Sep 14, 2024

tcleberg commented Sep 14, 2024

thomascleberg commented Sep 18, 2024

thomascleberg commented Oct 9, 2024

winglian commented Oct 11, 2024

NanoCode012 left a comment

Choose a reason for hiding this comment

Add Support for `revision` Dataset Parameter to specify reading from Huggingface Dataset Revision #1912

Add Support for `revision` Dataset Parameter to specify reading from Huggingface Dataset Revision #1912