-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add loading from the Datasets Hub + add relative paths in download manager #1860
Conversation
I just added the steps to share a dataset on the datasets hub. It's highly inspired by the steps to share a model in the Moreover once the new huggingface_hub is released we can update the version in the setup.py. We also need to update the command to create a dataset repo in the documentation I added a few more tests with the "lhoestq/test" dataset I added on the hub and it works fine :) |
Here is the PR adding support for datasets repos in |
Co-authored-by: Julien Chaumond <julien@huggingface.co>
With the new Datasets Hub on huggingface.co it's now possible to have a dataset repo with your own script and data.
For example: https://huggingface.co/datasets/lhoestq/custom_squad/tree/main contains one script and two json files.
You can load it using
To be able to use the data files that live right next to the dataset script on the repo in the hub, I added relative paths support for the DownloadManager. For example in the repo mentioned above, there are two json files that can be downloaded via
To make it work, I set the
base_path
of the DownloadManager to be the parent path of the dataset script (which comes from either a local path or a remote url).I also had to add the auth header of the requests to huggingface.co for private datasets repos. The token is fetched from huggingface_hub.