Add Support for Loading a Specific Dataset Revision #1911

thomascleberg · 2024-09-12T21:04:13Z

⚠️ Please check that this feature request hasn't been suggested before.

I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

This feature would add support from loading from a particular Huggingface dataset revision - that is, pr or commit hash. This is an important feature of datasets on Huggingface Hub, it's illustrated in the 2nd example of load_dataset.

Some dataset management strategies or workflows involve using these revisions as "candidates", so that a pull request would be merged when a successful experiment is completed using it. In this case, we need to be able to specify the revision of the dataset.

✔️ Solution

There would be an optional revision parameter on datasets that allows you to specify the revision number.

datasets:
  # HuggingFace dataset repo | s3://,gs:// path | "json" for local dataset, make sure to fill data_files
  - path: vicgalle/alpaca-gpt4
  # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
    type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
    ds_type: # Optional[str] (json|arrow|parquet|text|csv) defines the datatype when path is a file
    data_files: # Optional[str] path to source data files
    shards: # Optional[int] number of shards to split data into
    name: # Optional[str] name of dataset configuration to load
    train_on_split: train # Optional[str] name of dataset split to load from
    revision: # Optional[str] The specific revision of the dataset to use when loading from the Hugging Face Hub. This can be a commit hash, tag, or branch name. If not specified, the latest version will be used. This parameter is ignored for local datasets.

❓ Alternatives

I considered changing our workflow to have their be one dataset per experiment instead of revision, but that's not a realistic solution because it involves changing a whole experimental and merging setup to avoid specifying a revision number for the dataset.

📝 Additional Context

I have a PR ready to go on this one:

Add Support for revision Dataset Parameter to specify reading from Huggingface Dataset Revision #1912

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this feature has not been requested yet.
I have provided enough information for the maintainers to understand and evaluate this request.

The text was updated successfully, but these errors were encountered:

thomascleberg added the enhancement New feature or request label Sep 12, 2024

NanoCode012 closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Loading a Specific Dataset Revision #1911

Add Support for Loading a Specific Dataset Revision #1911

thomascleberg commented Sep 12, 2024 •

edited

Loading

Add Support for Loading a Specific Dataset Revision #1911

Add Support for Loading a Specific Dataset Revision #1911

Comments

thomascleberg commented Sep 12, 2024 • edited Loading

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

thomascleberg commented Sep 12, 2024 •

edited

Loading