You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
⚠️ Please check that this feature request hasn't been suggested before.
I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.
🔖 Feature description
This feature would add support from loading from a particular Huggingface dataset revision - that is, pr or commit hash. This is an important feature of datasets on Huggingface Hub, it's illustrated in the 2nd example of load_dataset.
Some dataset management strategies or workflows involve using these revisions as "candidates", so that a pull request would be merged when a successful experiment is completed using it. In this case, we need to be able to specify the revision of the dataset.
✔️ Solution
There would be an optional revision parameter on datasets that allows you to specify the revision number.
datasets:
# HuggingFace dataset repo | s3://,gs:// path | "json" for local dataset, make sure to fill data_files
- path: vicgalle/alpaca-gpt4# The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>ds_type: # Optional[str] (json|arrow|parquet|text|csv) defines the datatype when path is a filedata_files: # Optional[str] path to source data filesshards: # Optional[int] number of shards to split data intoname: # Optional[str] name of dataset configuration to loadtrain_on_split: train # Optional[str] name of dataset split to load fromrevision: # Optional[str] The specific revision of the dataset to use when loading from the Hugging Face Hub. This can be a commit hash, tag, or branch name. If not specified, the latest version will be used. This parameter is ignored for local datasets.
❓ Alternatives
I considered changing our workflow to have their be one dataset per experiment instead of revision, but that's not a realistic solution because it involves changing a whole experimental and merging setup to avoid specifying a revision number for the dataset.
🔖 Feature description
This feature would add support from loading from a particular Huggingface dataset revision - that is, pr or commit hash. This is an important feature of datasets on Huggingface Hub, it's illustrated in the 2nd example of
load_dataset
.Some dataset management strategies or workflows involve using these revisions as "candidates", so that a pull request would be merged when a successful experiment is completed using it. In this case, we need to be able to specify the revision of the dataset.
✔️ Solution
There would be an optional
revision
parameter ondatasets
that allows you to specify the revision number.❓ Alternatives
I considered changing our workflow to have their be one dataset per experiment instead of revision, but that's not a realistic solution because it involves changing a whole experimental and merging setup to avoid specifying a revision number for the dataset.
📝 Additional Context
I have a PR ready to go on this one:
revision
Dataset Parameter to specify reading from Huggingface Dataset Revision #1912Acknowledgements
The text was updated successfully, but these errors were encountered: