Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for revision Dataset Parameter to specify reading from Huggingface Dataset Revision #1912

Conversation

thomascleberg
Copy link
Contributor

Adds support for an optional dataset revision parameter.

Motivation and Context

This feature would add support from loading from a particular Huggingface dataset revision - that is, pr or commit hash. This is an important feature of datasets on Huggingface Hub, it's illustrated in the 2nd example of load_dataset.

Some dataset management strategies or workflows involve using these revisions as "candidates", so that a pull request would be merged when a successful experiment is completed using it. In this case, we need to be able to specify the revision of the dataset.

How has this been tested?

Wrote two unit tests. Tested in current pipelines to read training data from existing revision. Tested without the parameter and ensure that it read from main.

Types of changes

Parameters to HF data loading.

Social Handles (Optional)

@tcleberg

@winglian
Copy link
Collaborator

@thomascleberg , isn't revision specific to datasets loaded from huggingface hub? datasets loaded from local disk shouldn't need this so we probably shouldn't add this to some of the load_dataset calls just to be safe, otherwise, this PR is a great addition!

@winglian winglian force-pushed the feature/enable-huggingface-dataset-revision branch from 94f19fc to 1160b5f Compare September 14, 2024 17:03
@tcleberg
Copy link

@thomascleberg , isn't revision specific to datasets loaded from huggingface hub? datasets loaded from local disk shouldn't need this so we probably shouldn't add this to some of the load_dataset calls just to be safe, otherwise, this PR is a great addition!

Yep, perfectly reasonable!

@thomascleberg
Copy link
Contributor Author

@winglian what does the remaining process to get this merged look like?

@thomascleberg
Copy link
Contributor Author

yo - @winglian can you please help me understand what the process is from this point? We have to stop using this project if I can't get this merged imminently.

@winglian
Copy link
Collaborator

@thomascleberg sorry! just now circling back on my end on this. We're working through getting a process on being more responsive on PRs. @NanoCode012 just joined full time with us so that should help.

@winglian winglian force-pushed the feature/enable-huggingface-dataset-revision branch from f68bf5c to a41f8d3 Compare October 11, 2024 15:04
Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry, I was meaning to get to this PR this week but got blocked by another PR earlier today.

This one should be ready to go. The only missing part would be to extend for pretraining_dataset but that can be a separate PR.

@winglian winglian merged commit e73b8df into axolotl-ai-cloud:main Oct 11, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants