Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add loading from the Datasets Hub + add relative paths in download manager #1860

Merged
merged 25 commits into from
Feb 12, 2021

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Feb 10, 2021

With the new Datasets Hub on huggingface.co it's now possible to have a dataset repo with your own script and data.
For example: https://huggingface.co/datasets/lhoestq/custom_squad/tree/main contains one script and two json files.

You can load it using

from datasets import load_dataset

d = load_dataset("lhoestq/custom_squad")

To be able to use the data files that live right next to the dataset script on the repo in the hub, I added relative paths support for the DownloadManager. For example in the repo mentioned above, there are two json files that can be downloaded via

_URLS = {
    "train": "train-v1.1.json",
    "dev": "dev-v1.1.json",
}
downloaded_files = dl_manager.download_and_extract(_URLS)

To make it work, I set the base_path of the DownloadManager to be the parent path of the dataset script (which comes from either a local path or a remote url).

I also had to add the auth header of the requests to huggingface.co for private datasets repos. The token is fetched from huggingface_hub.

@lhoestq lhoestq changed the title Allow relative paths in download manager Add loading from the Datasets Hub + add relative paths in download manager Feb 11, 2021
@lhoestq
Copy link
Member Author

lhoestq commented Feb 12, 2021

I just added the steps to share a dataset on the datasets hub. It's highly inspired by the steps to share a model in the transformers doc.

Moreover once the new huggingface_hub is released we can update the version in the setup.py. We also need to update the command to create a dataset repo in the documentation

I added a few more tests with the "lhoestq/test" dataset I added on the hub and it works fine :)

@lhoestq
Copy link
Member Author

lhoestq commented Feb 12, 2021

Here is the PR adding support for datasets repos in huggingface_hub: huggingface/huggingface_hub#14

README.md Outdated Show resolved Hide resolved
docs/source/share_dataset.rst Outdated Show resolved Hide resolved
docs/source/share_dataset.rst Outdated Show resolved Hide resolved
Co-authored-by: Julien Chaumond <julien@huggingface.co>
@lhoestq lhoestq merged commit 66f2a7e into master Feb 12, 2021
@lhoestq lhoestq deleted the allow-relative-paths-in-download-manager branch February 12, 2021 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants