Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs once dataset scripts transferred to the Hub #5136

Merged
merged 7 commits into from
Oct 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .github/ISSUE_TEMPLATE/add-dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
name: "Add Dataset"
about: Request the addition of a specific dataset to the library.
title: ''
labels: 'dataset request'
assignees: ''

---

## Adding a Dataset
- **Name:** *name of the dataset*
- **Description:** *short description of the dataset (or link to social media or blog post)*
- **Paper:** *link to the dataset paper if available*
- **Data:** *link to the Github repository or current dataset location*
- **Motivation:** *what are some good reasons to have this dataset*
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
contact_links:
- name: Datasets on the Hugging Face Hub
url: https://huggingface.co/datasets
about: Open a Pull request / Discussion related to a specific dataset on the Hugging Face Hub (PRs for datasets with no namespace still have to be on GitHub though)
about: Please use the "Community" tab of the dataset on the Hugging Face Hub to open a discussion or a pull request
- name: Forum
url: https://discuss.huggingface.co/c/datasets/10
about: Please ask and answer questions here, and engage with other community members
365 changes: 8 additions & 357 deletions ADD_NEW_DATASET.md

Large diffs are not rendered by default.

4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@

[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)

[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md)
[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://huggingface.co/docs/datasets/share.html)

<h3 align="center">
<a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/main/docs/source/imgs/course_banner.png"></a>
Expand Down Expand Up @@ -127,8 +127,6 @@ We have a very detailed step-by-step guide to add a new dataset to the ![number

You will find [the step-by-step guide here](https://huggingface.co/docs/datasets/share.html) to add a dataset on the Hub.

However if you prefer to add your dataset in this repository, you can find the guide [here](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).

# Main differences between 🤗 Datasets and `tfds`

If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and `tfds`:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/about_dataset_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -102,12 +102,12 @@ To ensure a dataset is complete, [`load_dataset`] will perform a series of tests
If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files.
In this case, an error is raised to alert that the dataset has changed.
To ignore the error, one needs to specify `ignore_verifications=True` in [`load_dataset`].
Anytime you see a verification error, feel free to [open an issue on GitHub](https://github.com/huggingface/datasets/issues) so that we can update the integrity checks for this dataset.
Anytime you see a verification error, feel free to open a discussion or pull request in the corresponding dataset "Community" tab, so that the integrity checks for that dataset are updated.

## Security

The dataset repositories on the Hub are scanned for malware, see more information [here](https://huggingface.co/docs/hub/security#malware-scanning).

Moreover the datasets that were constributed on our GitHub repository have all been reviewed by our maintainers.
Moreover the datasets without a namespace (originally contributed on our GitHub repository) have all been reviewed by our maintainers.
The code of these datasets is considered **safe**.
It concerns datasets that are not under a namespace, e.g. "squad" or "glue", unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
2 changes: 1 addition & 1 deletion docs/source/loading.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@ An object data type in [pandas.Series](https://pandas.pydata.org/docs/reference/

## Offline

Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub or 🤗 Datasets GitHub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.
Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.

If you know you won't have internet access, you can run 🤗 Datasets in full offline mode. This saves time because instead of waiting for the Dataset builder download to time out, 🤗 Datasets will look directly in the cache. Set the environment variable `HF_DATASETS_OFFLINE` to `1` to enable full offline mode.

Expand Down
14 changes: 5 additions & 9 deletions docs/source/share.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -144,18 +144,14 @@ Members of the Hugging Face team will be happy to review your dataset script and
## Datasets on GitHub (legacy)

Datasets used to be hosted on our GitHub repository, but all datasets have now been migrated to the Hugging Face Hub.
The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
Those datasets are still maintained on GitHub, and if you'd like to edit them, please open a Pull Request on the huggingface/datasets repository.
Sharing your dataset to the Hub is the recommended way of adding a dataset.

The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace on the Hub: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".

<Tip>

The distinction between a Hub dataset and a dataset from GitHub only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.
The distinction between a Hub dataset within or without a namespace only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.

</Tip>

The code of these datasets are reviewed by the Hugging Face team, and they require test data in order to be regularly tested.

In some rare cases it makes more sense to open a PR on GitHub. For example when you are not the author of the dataset and there is no clear organization / namespace that you can put the dataset under.

For more info, please take a look at the documentation on [How to add a new dataset in the huggingface/datasets repository](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).
Those datasets are mow maintained on the Hub: if you think a fix is needed, please use their "Community" tab to open a discussion or create a Pull Request.
The code of these datasets is reviewed by the Hugging Face team.
18 changes: 6 additions & 12 deletions src/datasets/inspect.py
Original file line number Diff line number Diff line change
Expand Up @@ -341,12 +341,9 @@ def get_dataset_config_info(
data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s).
download_config (:class:`~download.DownloadConfig`, optional): Specific download configuration parameters.
download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:

- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
If True, will get token from `"~/.huggingface"`.
**config_kwargs (additional keyword arguments): optional attributes for builder class which will override the attributes if supplied.
Expand Down Expand Up @@ -405,12 +402,9 @@ def get_dataset_split_names(
data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s).
download_config (:class:`~download.DownloadConfig`, optional): Specific download configuration parameters.
download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:

- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
If True, will get token from `"~/.huggingface"`.
**config_kwargs (additional keyword arguments): optional attributes for builder class which will override the attributes if supplied.
Expand Down
30 changes: 9 additions & 21 deletions src/datasets/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -1079,12 +1079,9 @@ def dataset_module_factory(
-> load the dataset builder from the dataset script in the dataset repository
e.g. ``glue``, ``squad``, ``'username/dataset_name'``, a dataset repository on the HF hub containing a dataset script `'dataset_name.py'`.

revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:

- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
download_config (:class:`DownloadConfig`, optional): Specific download configuration parameters.
download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
dynamic_modules_path (Optional str, defaults to HF_MODULES_CACHE / "datasets_modules", i.e. ~/.cache/huggingface/modules/datasets_modules):
Expand Down Expand Up @@ -1121,9 +1118,6 @@ def dataset_module_factory(
# - if path is a local directory (but no python file)
# -> use a packaged module (csv, text etc.) based on content of the directory
#
# - if path has no "/" and is a module on GitHub (in /datasets)
# -> use the module from the python file on GitHub
# Note that this case will be removed in favor of loading from the HF Hub instead eventually
# - if path has one "/" and is dataset repository on the HF hub with a python file
# -> the module from the python file in the dataset repository
# - if path has one "/" and is dataset repository on the HF hub without a python file
Expand Down Expand Up @@ -1459,12 +1453,9 @@ def load_dataset_builder(
features (:class:`Features`, optional): Set the features type to use for this dataset.
download_config (:class:`~utils.DownloadConfig`, optional): Specific download configuration parameters.
download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:

- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
If True, will get token from `"~/.huggingface"`.
**config_kwargs (additional keyword arguments): Keyword arguments to be passed to the :class:`BuilderConfig`
Expand Down Expand Up @@ -1629,12 +1620,9 @@ def load_dataset(
will not be copied in-memory unless explicitly enabled by setting `datasets.config.IN_MEMORY_MAX_SIZE` to
nonzero. See more details in the :ref:`load_dataset_enhancing_performance` section.
save_infos (:obj:`bool`, default ``False``): Save the dataset information (checksums/size/splits/...).
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:

- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
If True, will get token from `"~/.huggingface"`.
task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` to standardized column names and types as detailed in :py:mod:`datasets.tasks`.
Expand Down