Skip to content

Commit

Permalink
Update doc links to point to new docs (#3116)
Browse files Browse the repository at this point in the history
* Update README links

* Update docs

* Update add dataset template

* Add Features docstring
  • Loading branch information
mariosasko authored Oct 22, 2021
1 parent 1a9380a commit ac0d1d1
Show file tree
Hide file tree
Showing 8 changed files with 44 additions and 17 deletions.
8 changes: 4 additions & 4 deletions ADD_NEW_DATASET.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ Now let's get coding :-)

The dataset script is the main entry point to load and process the data. It is a python script under `datasets/<your_dataset_name>/<your_dataset_name>.py`.

There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/add_dataset.html).
There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/master/about_dataset_load.html).

Note on naming: the dataset class should be camel case, while the dataset short_name is its snake case equivalent (ex: `class BookCorpus` for the dataset `book_corpus`).

Expand All @@ -96,7 +96,7 @@ To add a new dataset, you can start from the empty template which is [in the `te
cp ./templates/new_dataset_script.py ./datasets/<your_dataset_name>/<your_dataset_name>.py
```

And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/add_dataset.html).
And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/master/dataset_script.html).

You can also start (or copy any part) from one of the datasets of reference listed below. The main criteria for choosing among these reference dataset is the format of the data files (JSON/JSONL/CSV/TSV/text) and whether you need or don't need several configurations (see above explanations on configurations). Feel free to reuse any parts of the following examples and adapt them to your case:

Expand Down Expand Up @@ -137,7 +137,7 @@ Sometimes you need to use several *configurations* and/or *splits* (usually at l
**Some rules to follow when adding the dataset**:

- try to give access to all the data, columns, features and information in the dataset. If the dataset contains various sub-parts with differing formats, create several configurations to give access to all of them.
- datasets in the `datasets` library are typed. Take some time to carefully think about the `features` (see an introduction [here](https://huggingface.co/docs/datasets/exploring.html#features-and-columns) and the full list of possible features [here](https://huggingface.co/docs/datasets/features.html))
- datasets in the `datasets` library are typed. Take some time to carefully think about the `features` (see an introduction [here](https://huggingface.co/docs/datasets/about_dataset_features.html) and the full list of possible features [here](https://huggingface.co/docs/datasets/package_reference/main_classes.html#features)
- if some of you dataset features are in a fixed set of classes (e.g. labels), you should use a `ClassLabel` feature.


Expand Down Expand Up @@ -179,7 +179,7 @@ Now that your dataset script runs and create a dataset with the format you expec
datasets-cli dummy_data datasets/<your-dataset-folder>
```

If this doesn't work more information on how to add dummy data can be found in the documentation [here](https://huggingface.co/docs/datasets/share_dataset.html#adding-dummy-data).
If this doesn't work more information on how to add dummy data can be found in the documentation [here](https://huggingface.co/docs/datasets/dataset_script.html#dummy-data).

If you've been fighting with dummy data creation without success for some time and can't seems to make it work: Go to the next step (open a Pull Request) and we'll help you cross the finish line 🙂.

Expand Down
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
- Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
- Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.

🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between-🤗-datasets-and-tfds).
🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between--datasets-and-tfds).

# Installation

Expand Down Expand Up @@ -74,7 +74,7 @@ For more details on installation, check the installation page in the documentati

If you plan to use 🤗 Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.

For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html
For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart.html

# Usage

Expand Down Expand Up @@ -113,12 +113,12 @@ tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)
```

For more details on using the library, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html and the specific pages on:
For more details on using the library, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart.html and the specific pages on:

- Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html
- What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html
- Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/processing.html
- Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html
- Loading a dataset https://huggingface.co/docs/datasets/loading.html
- What's in a Dataset: https://huggingface.co/docs/datasets/access.html
- Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/process.html
- Writing your own dataset loading script: https://huggingface.co/docs/datasets/dataset_script.html
- etc.

Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
Expand All @@ -130,7 +130,7 @@ We have a very detailed step-by-step guide to add a new dataset to the ![number

You will find [the step-by-step guide here](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md) to add a dataset to this repository.

You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share_dataset.html).
You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share.html).

# Main differences between 🤗 Datasets and `tfds`

Expand Down
2 changes: 1 addition & 1 deletion docs/source/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Before you start, you will need to setup your environment and install the appropriate packages. 🤗 Datasets is tested on **Python 3.6+**.

```{seealso}
If you want to use 🤗 Datasets with TensorFlow or PyTorch, you will need to install them separately. Refer to the [TensorFlow](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) or the [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) for the specific install command for your framework.
If you want to use 🤗 Datasets with TensorFlow or PyTorch, you will need to install them separately. Refer to the [TensorFlow](https://www.tensorflow.org/install/pip#tensorflow-2-packages-are-available) or the [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) for the specific install command for your framework.
```

## Virtual environment
Expand Down
2 changes: 1 addition & 1 deletion docs/source/process.rst
Original file line number Diff line number Diff line change
Expand Up @@ -408,7 +408,7 @@ Data augmentation

With batch processing, you can even augment your dataset with additional examples. In the following example, you will generate additional words for a masked token in a sentence.

Load the `RoBERTA <https://huggingface.co/roberta-base>`_ model for use in the 🤗 Transformer `FillMaskPipeline <https://huggingface.co/transformers/main_classes/pipelines.html?#transformers.FillMaskPipeline>`_:
Load the `RoBERTA <https://huggingface.co/roberta-base>`_ model for use in the 🤗 Transformer `FillMaskPipeline <https://huggingface.co/transformers/main_classes/pipelines.html#transformers.FillMaskPipeline>`_:

.. code-block::
Expand Down
2 changes: 1 addition & 1 deletion docs/source/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Format the dataset

Depending on whether you are using PyTorch, TensorFlow, or JAX, you will need to format the dataset accordingly. There are three changes you need to make to the dataset:

1. Rename the ``label`` column to ``labels``, the expected input name in `BertForSequenceClassification <https://huggingface.co/transformers/model_doc/bert.html?#transformers.BertForSequenceClassification.forward>`__ or `TFBertForSequenceClassification <https://huggingface.co/transformers/model_doc/bert.html?#tfbertforsequenceclassification>`__:
1. Rename the ``label`` column to ``labels``, the expected input name in `BertForSequenceClassification <https://huggingface.co/transformers/model_doc/bert.html#transformers.BertForSequenceClassification.forward>`__ or `TFBertForSequenceClassification <https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification>`__:

.. code::
Expand Down
2 changes: 1 addition & 1 deletion docs/source/share.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Create the repository
^^^^^^^^^^^^^^^^^^^^^

Sharing a community dataset will require you to create an account on `hf.co <https://huggingface.co/join>`_ if you don't have one yet.
You can directly create a `new dataset repository <https://huggingface.co/new-dataset>`_ from your account on the Hugging Face Hub, but this guide will show you how to upload a dataset from the terminal.
You can directly create a `new dataset repository <https://huggingface.co/login?next=%2Fnew-dataset>`_ from your account on the Hugging Face Hub, but this guide will show you how to upload a dataset from the terminal.

1. Make sure you are in the virtual environment where you installed Datasets, and run the following command:

Expand Down
27 changes: 27 additions & 0 deletions src/datasets/features/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -888,6 +888,33 @@ def list_of_np_array_to_pyarrow_listarray(l_arr: List[np.ndarray], type: pa.Data


class Features(dict):
"""A special dictionary that defines the internal structure of a dataset.
Instantiated with a dictionary of type ``dict[str, FieldType]``, where keys are the desired column names,
and values are the type of that column.
``FieldType`` can be one of the following:
- a :class:`datasets.Value` feature specifies a single typed value, e.g. ``int64`` or ``string``
- a :class:`datasets.ClassLabel` feature specifies a field with a predefined set of classes which can have labels
associated to them and will be stored as integers in the dataset
- a python :obj:`dict` which specifies that the field is a nested field containing a mapping of sub-fields to sub-fields
features. It's possible to have nested fields of nested fields in an arbitrary manner
- a python :obj:`list` or a :class:`datasets.Sequence` specifies that the field contains a list of objects. The python
:obj:`list` or :class:`datasets.Sequence` should be provided with a single sub-feature as an example of the feature
type hosted in this list
.. note::
A :class:`datasets.Sequence` with a internal dictionary feature will be automatically converted into a dictionary of
lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be
un-wanted in some cases. If you don't want this behavior, you can use a python :obj:`list` instead of the
:class:`datasets.Sequence`.
- a :class:`Array2D`, :class:`Array3D`, :class:`Array4D` or :class:`Array5D` feature for multidimensional arrays
- a :class:`datasets.Audio` stores the path to an audio file and can extract audio data from it
- :class:`datasets.Translation` and :class:`datasets.TranslationVariableLanguages`, the two features specific to Machine Translation
"""

@property
def type(self):
"""
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -1498,7 +1498,7 @@ def load_dataset(
Processing scripts are small python scripts that define the citation, info and format of the dataset,
contain the URL to the original data files and the code to load examples from the original data files.
You can find some of the scripts here: https://github.com/huggingface/datasets/datasets
You can find some of the scripts here: https://github.com/huggingface/datasets/tree/master/datasets
and easily upload yours to share them using the CLI ``huggingface-cli``.
You can find the complete list of datasets in the Datasets Hub at https://huggingface.co/datasets
Expand Down

1 comment on commit ac0d1d1

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009046 / 0.011353 (-0.002307) 0.003829 / 0.011008 (-0.007179) 0.031726 / 0.038508 (-0.006783) 0.035319 / 0.023109 (0.012209) 0.294737 / 0.275898 (0.018839) 0.406894 / 0.323480 (0.083414) 0.007684 / 0.007986 (-0.000301) 0.004721 / 0.004328 (0.000393) 0.009009 / 0.004250 (0.004759) 0.037858 / 0.037052 (0.000806) 0.295518 / 0.258489 (0.037029) 0.334510 / 0.293841 (0.040669) 0.023773 / 0.128546 (-0.104774) 0.008261 / 0.075646 (-0.067385) 0.256729 / 0.419271 (-0.162543) 0.047031 / 0.043533 (0.003498) 0.298975 / 0.255139 (0.043836) 0.319067 / 0.283200 (0.035867) 0.084135 / 0.141683 (-0.057548) 1.699933 / 1.452155 (0.247778) 1.725804 / 1.492716 (0.233087)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.203988 / 0.018006 (0.185982) 0.435049 / 0.000490 (0.434560) 0.005692 / 0.000200 (0.005492) 0.000130 / 0.000054 (0.000076)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036560 / 0.037411 (-0.000851) 0.022281 / 0.014526 (0.007756) 0.027498 / 0.176557 (-0.149059) 0.126638 / 0.737135 (-0.610497) 0.029593 / 0.296338 (-0.266745)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.423170 / 0.215209 (0.207961) 4.235989 / 2.077655 (2.158334) 1.959111 / 1.504120 (0.454991) 1.751527 / 1.541195 (0.210332) 1.806348 / 1.468490 (0.337858) 0.374851 / 4.584777 (-4.209926) 4.671009 / 3.745712 (0.925297) 0.879231 / 5.269862 (-4.390631) 0.831568 / 4.565676 (-3.734109) 0.041025 / 0.424275 (-0.383250) 0.004842 / 0.007607 (-0.002765) 0.533253 / 0.226044 (0.307208) 5.313452 / 2.268929 (3.044524) 2.377745 / 55.444624 (-53.066880) 2.005812 / 6.876477 (-4.870664) 2.011469 / 2.142072 (-0.130603) 0.480115 / 4.805227 (-4.325113) 0.102168 / 6.500664 (-6.398496) 0.050539 / 0.075469 (-0.024930)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.539189 / 1.841788 (-0.302599) 12.654936 / 8.074308 (4.580628) 27.700883 / 10.191392 (17.509491) 0.802002 / 0.680424 (0.121578) 0.520520 / 0.534201 (-0.013681) 0.225931 / 0.579283 (-0.353352) 0.503459 / 0.434364 (0.069095) 0.190536 / 0.540337 (-0.349802) 0.202975 / 1.386936 (-1.183961)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009164 / 0.011353 (-0.002189) 0.003868 / 0.011008 (-0.007140) 0.031605 / 0.038508 (-0.006903) 0.034851 / 0.023109 (0.011742) 0.283199 / 0.275898 (0.007301) 0.320169 / 0.323480 (-0.003311) 0.007856 / 0.007986 (-0.000130) 0.004821 / 0.004328 (0.000493) 0.009264 / 0.004250 (0.005014) 0.043346 / 0.037052 (0.006294) 0.280910 / 0.258489 (0.022421) 0.324817 / 0.293841 (0.030976) 0.024437 / 0.128546 (-0.104110) 0.008417 / 0.075646 (-0.067229) 0.255198 / 0.419271 (-0.164074) 0.047787 / 0.043533 (0.004254) 0.290293 / 0.255139 (0.035154) 0.314677 / 0.283200 (0.031478) 0.086350 / 0.141683 (-0.055333) 1.717138 / 1.452155 (0.264984) 1.809649 / 1.492716 (0.316933)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.324396 / 0.018006 (0.306389) 0.438402 / 0.000490 (0.437912) 0.046491 / 0.000200 (0.046291) 0.000403 / 0.000054 (0.000349)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034565 / 0.037411 (-0.002847) 0.021114 / 0.014526 (0.006588) 0.026075 / 0.176557 (-0.150482) 0.123443 / 0.737135 (-0.613692) 0.026923 / 0.296338 (-0.269415)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.418453 / 0.215209 (0.203244) 4.141372 / 2.077655 (2.063718) 1.811950 / 1.504120 (0.307830) 1.599226 / 1.541195 (0.058032) 1.608073 / 1.468490 (0.139583) 0.375272 / 4.584777 (-4.209505) 4.782414 / 3.745712 (1.036702) 0.883024 / 5.269862 (-4.386838) 0.829629 / 4.565676 (-3.736048) 0.041051 / 0.424275 (-0.383224) 0.004829 / 0.007607 (-0.002778) 0.521300 / 0.226044 (0.295256) 5.217215 / 2.268929 (2.948286) 2.215123 / 55.444624 (-53.229501) 1.846728 / 6.876477 (-5.029749) 1.857763 / 2.142072 (-0.284310) 0.481495 / 4.805227 (-4.323732) 0.102269 / 6.500664 (-6.398395) 0.051324 / 0.075469 (-0.024145)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.537262 / 1.841788 (-0.304526) 12.621657 / 8.074308 (4.547349) 26.272518 / 10.191392 (16.081126) 0.746616 / 0.680424 (0.066193) 0.515982 / 0.534201 (-0.018219) 0.226246 / 0.579283 (-0.353037) 0.505247 / 0.434364 (0.070883) 0.183750 / 0.540337 (-0.356587) 0.198922 / 1.386936 (-1.188014)

CML watermark

Please sign in to comment.