-
-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for revision
Dataset Parameter to specify reading from Huggingface Dataset Revision
#1912
Add Support for revision
Dataset Parameter to specify reading from Huggingface Dataset Revision
#1912
Conversation
@thomascleberg , isn't revision specific to datasets loaded from huggingface hub? datasets loaded from local disk shouldn't need this so we probably shouldn't add this to some of the load_dataset calls just to be safe, otherwise, this PR is a great addition! |
94f19fc
to
1160b5f
Compare
Yep, perfectly reasonable! |
@winglian what does the remaining process to get this merged look like? |
yo - @winglian can you please help me understand what the process is from this point? We have to stop using this project if I can't get this merged imminently. |
@thomascleberg sorry! just now circling back on my end on this. We're working through getting a process on being more responsive on PRs. @NanoCode012 just joined full time with us so that should help. |
f68bf5c
to
a41f8d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, sorry, I was meaning to get to this PR this week but got blocked by another PR earlier today.
This one should be ready to go. The only missing part would be to extend for pretraining_dataset
but that can be a separate PR.
Adds support for an optional dataset
revision
parameter.Motivation and Context
This feature would add support from loading from a particular Huggingface dataset revision - that is, pr or commit hash. This is an important feature of datasets on Huggingface Hub, it's illustrated in the 2nd example of load_dataset.
Some dataset management strategies or workflows involve using these revisions as "candidates", so that a pull request would be merged when a successful experiment is completed using it. In this case, we need to be able to specify the revision of the dataset.
How has this been tested?
Wrote two unit tests. Tested in current pipelines to read training data from existing revision. Tested without the parameter and ensure that it read from
main
.Types of changes
Parameters to HF data loading.
Social Handles (Optional)
@tcleberg