Merge branch 'main' into sharded-save_to_disk

huggingface · Dec 8, 2022 · 16bb14c · 16bb14c · github-actions · Dec 8, 2022
2 parents 24e24bf + 45508f7
commit 16bb14c
Show file tree

Hide file tree

Showing 48 changed files with 5,935 additions and 5,949 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -37,7 +37,7 @@ jobs:
       matrix:
         test: ['unit', 'integration']
         os: [ubuntu-latest, windows-latest]
-        deps_versions: [latest, minimum]
+        deps_versions: [deps-latest, deps-minimum]
     continue-on-error: ${{ matrix.test == 'integration' }}
     runs-on: ${{ matrix.os }}
     steps:
@@ -52,25 +52,23 @@ jobs:
       - name: Set up Python 3.7
         uses: actions/setup-python@v4
         with:
-          python-version: 3.7
+          python-version: "3.7"
       - name: Upgrade pip
         run: python -m pip install --upgrade pip
       - name: Pin setuptools-scm
         if: ${{ matrix.os == 'ubuntu-latest' }}
         run: echo "installing pinned version of setuptools-scm to fix seqeval installation on 3.7" && pip install "setuptools-scm==6.4.2"
       - name: Install dependencies
         run: |
-          pip install .[tests]
+          pip install .[tests,metrics-tests]
           pip install -r additional-tests-requirements.txt --no-deps
           python -m spacy download en_core_web_sm
           python -m spacy download fr_core_news_sm
       - name: Install dependencies (latest versions)
-        if: ${{ matrix.deps_versions == 'latest' }}
-        run: |
-          pip uninstall -y apache-beam
-          pip install --upgrade pyarrow huggingface-hub dill
-      - name: Install dependencies (minimum versions)
-        if: ${{ matrix.deps_versions != 'latest' }}
+        if: ${{ matrix.deps_versions == 'deps-latest' }}
+        run: pip install --upgrade pyarrow huggingface-hub dill
+      - name: Install depencencies (minimum versions)
+        if: ${{ matrix.deps_versions != 'deps-latest' }}
         run: pip install pyarrow==6.0.1 huggingface-hub==0.2.0 transformers dill==0.3.1.1
       - name: Test with pytest
         run: |
@@ -85,3 +83,43 @@ jobs:
         if: ${{ matrix.os == 'ubuntu-latest' }}
         run: |
           python -m pytest -rfExX -m torchaudio_latest -n 2 --dist loadfile -sv ./tests/features/test_audio.py
+
+  test_py310:
+    needs: check_code_quality
+    strategy:
+      matrix:
+        test: ['unit']
+        os: [ubuntu-latest, windows-latest]
+        deps_versions: [deps-latest]
+    continue-on-error: false
+    runs-on: ${{ matrix.os }}
+    steps:
+      - name: Install OS dependencies
+        if: ${{ matrix.os == 'ubuntu-latest' }}
+        run: |
+          sudo apt-get -y update
+          sudo apt-get -y install libsndfile1 sox
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+      - name: Set up Python 3.10
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"
+      - name: Upgrade pip
+        run: python -m pip install --upgrade pip
+      - name: Install dependencies
+        run: pip install .[tests]
+      - name: Test with pytest
+        run: |
+          python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
+      - name: Install dependencies to test torchaudio>=0.12 on Ubuntu
+        if: ${{ matrix.os == 'ubuntu-latest' }}
+        run: |
+          pip uninstall -y torchaudio torch
+          pip install "torchaudio>=0.12"
+          sudo apt-get -y install ffmpeg
+      - name: Test torchaudio>=0.12 on Ubuntu
+        if: ${{ matrix.os == 'ubuntu-latest' }}
+        run: |
+          python -m pytest -rfExX -m torchaudio_latest -n 2 --dist loadfile -sv ./tests/features/test_audio.py
diff --git a/README.md b/README.md
@@ -27,8 +27,8 @@
 
 🤗 Datasets is a lightweight library providing **two** main features:
 
-- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (text datasets in 467 languages and dialects, image datasets, audio datasets, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
-- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text/PNG/JPEG/etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
+- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
+- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
 
 [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)
 
@@ -46,6 +46,8 @@
 - Smart caching: never wait for your data to process several times.
 - Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
 - Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.
+- Native support for audio and image data
+- Enable streaming mode to save disk space and start iterating over the dataset immediately.
 
 🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between--datasets-and-tfds).
 
@@ -108,13 +110,24 @@ tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
 tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)
 ```
 
+If your dataset is bigger than your disk or if you don't want to wait to download the data, you can use streaming:
+
+```python
+# If you want to use the dataset immediately and efficiently stream the data as you iterate over the dataset
+image_dataset = load_dataset('cifar100', streaming=True)
+for example in image_dataset["train"]:
+    break
+```
+
 For more details on using the library, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart.html and the specific pages on:
 
-- Loading a dataset https://huggingface.co/docs/datasets/loading
+- Loading a dataset: https://huggingface.co/docs/datasets/loading
 - What's in a Dataset: https://huggingface.co/docs/datasets/access
 - Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/process
-- Processing audio data: https://huggingface.co/docs/datasets/audio_process
-- Processing image data: https://huggingface.co/docs/datasets/image_process
+    - Processing audio data: https://huggingface.co/docs/datasets/audio_process
+    - Processing image data: https://huggingface.co/docs/datasets/image_process
+    - Processing text data: https://huggingface.co/docs/datasets/nlp_process
+- Streaming a dataset: https://huggingface.co/docs/datasets/stream
 - Writing your own dataset loading script: https://huggingface.co/docs/datasets/dataset_script
 - etc.
 
@@ -125,7 +138,9 @@ Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
 
 We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).
 
-You will find [the step-by-step guide here](https://huggingface.co/docs/datasets/share.html) to add a dataset on the Hub.
+You can find:
+- [how to upload a dataset to the Hub using your web browser or Python](https://huggingface.co/docs/datasets/upload_dataset) and also
+- [how to upload it using Git](https://huggingface.co/docs/datasets/share).
 
 # Main differences between 🤗 Datasets and `tfds`
 
@@ -140,7 +155,11 @@ If you are familiar with the great TensorFlow Datasets, here are the main differ
 
 Similar to TensorFlow Datasets, 🤗 Datasets is a utility library that downloads and prepares public datasets. We do not host or distribute most of these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
 
-If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a [GitHub issue](https://github.com/huggingface/datasets/issues/new). Thanks for your contribution to the ML community!
+Moreover 🤗 Datasets may run Python code defined by the dataset authors to parse certain data formats or structures. For security reasons, we ask users to:
+- check the dataset scripts they're going to run beforehand and
+- pin the `revision` of the repositories they use.
+
+If you're a dataset owner and wish to update any part of it (description, citation, license, etc.), or do not want your dataset to be included in the Hugging Face Hub, please get in touch by opening a discussion or a pull request in the Community tab of the dataset page. Thanks for your contribution to the ML community!
 
 ## BibTeX
 

diff --git a/docs/source/audio_dataset.mdx b/docs/source/audio_dataset.mdx
@@ -10,11 +10,11 @@ dataset = load_dataset("<username>/my_dataset")
 
 There are several methods for creating and sharing an audio dataset:
 
-1. Create an audio dataset from local files in python with [`Dataset.push_to_hub`]. This is an easy way that requires only a few steps in python.
+* Create an audio dataset from local files in python with [`Dataset.push_to_hub`]. This is an easy way that requires only a few steps in python.
 
-1. Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating small dataset to experiment with.
+* Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.
 
-1. Create an audio dataset by writing a loading script. This method is for advanced users and requires more effort and coding, but you have greater flexibility over how a dataset is defined, downloaded, and generated.
+* Create an audio dataset by writing a loading script. This method is for advanced users and requires more effort and coding, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale audio datasets.
 
 
 <Tip>
@@ -53,7 +53,7 @@ my_dataset/
 
 ## AudioFolder
 
-The `AudioFolder` is a dataset builder designed to quickly load an audio dataset without requiring you to write any code.
+The `AudioFolder` is a dataset builder designed to quickly load an audio dataset with several thousand audio files without requiring you to write any code.
 Any additional information about your dataset - such as transcription, speaker accent, or speaker intent - is automatically loaded by `AudioFolder` as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`). 
 
 Create a dataset repository on the Hugging Face Hub and upload your dataset directory following the `AudioFolder` structure:
@@ -192,9 +192,9 @@ This directory structure allows your dataset to be loaded in one line:
 ```
 
 This guide will show you how to create a dataset loading script for audio datasets, which is a bit different from <a class="underline decoration-green-400 decoration-2 font-semibold" href="./dataset_script">creating a loading script for text datasets</a>.
-Audio datasets are commonly stored in `tar.gz` archives which requires a particular approach to support streaming mode. While streaming is not required, we highly encourage enabling streaming support in your audio dataset because:
+Audio datasets are commonly stored in `tar.gz` archives which requires a particular approach to support streaming mode. While streaming is not required, we highly encourage implementing streaming support in your audio dataset because:
 
-1. Users without a lot of disk space can use your dataset without waiting for the entire dataset to be downloaded. Learn more about streaming in the [Stream](./stream) guide!
+1. Users without a lot of disk space can use your dataset without downloading it. Learn more about streaming in the [Stream](./stream) guide!
 2. Users can preview a dataset in the dataset viewer.
 
 Here is an example using TAR archives:
@@ -367,7 +367,7 @@ def _info(self):
 
 Now that you've added some information about your dataset, the next step is to download the dataset and define the splits.
 
-1. Use the [`~DownloadManager.download`] method to download metadata file at `_PROMPTS_URLS` and audio TAR archive at `_DATA_URL`. This method returns the path to the local file/archive. In streaming mode, it returns a URL to stream the data from. This method accepts:
+1. Use the [`~DownloadManager.download`] method to download metadata file at `_PROMPTS_URLS` and audio TAR archive at `_DATA_URL`. This method returns the path to the local file/archive. In streaming mode, it doesn't download the file(s) and just returns a URL to stream the data from. This method accepts:
 
     * a relative path to a file inside a Hub dataset repository (for example, in the `data/` folder)
     * a URL to a file hosted somewhere else
@@ -504,21 +504,15 @@ To explain how to do the extraction in a way that it also supports streaming, we
    local_extracted_archive = dl_manager.extract(audio_path) if not dl_manager.is_streaming else None
    ```
 
-3. Use the [`~DownloadManager.iter_archive`] method to iterate over the archive at `audio_path` after it's downloaded, just like in the Vivos example above. [`~DownloadManager.iter_archive`] doesn't provide any information about the full paths of files from the archive, even if it has been extracted. As a result, you need to pass the `local_extracted_archive` path to the next step in `gen_kwargs`, in order to preserve information about where the archive was extracted to. This is required to construct the correct paths to the local files when you generate the examples.
+3. Use the [`~DownloadManager.iter_archive`] method to iterate over the archive at `audio_path`, just like in the Vivos example above. [`~DownloadManager.iter_archive`] doesn't provide any information about the full paths of files from the archive, even if it has been extracted. As a result, you need to pass the `local_extracted_archive` path to the next step in `gen_kwargs`, in order to preserve information about where the archive was extracted to. This is required to construct the correct paths to the local files when you generate the examples.
 
-<Tip>
+<Tip warning={true}>
 
-The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths. Instead, you'll need to download it first and then sequentially iterate over the files within the archive!
+The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because files in TAR archives can't be accessed directly by their paths. Instead, you'll need to iterate over the files within the archive! You can use [`~DownloadManager.download_and_extract`] and [`~DownloadManager.extract`] with TAR archives only in non-streaming mode, otherwise it would throw an error.
 
 </Tip>
 
-4. Use the [`~DownloadManager.download_and_extract`] method to download the metadata file specified in `_METADATA_URL`. This method returns a path to a local file in non-streaming mode. In streaming mode, it opens the file at the URL remotely and returns this URL.
-
-<Tip>
-
-    You can use [`~DownloadManager.download_and_extract`] to download and extract TAR archives too, but this method, as well as [`~DownloadManager.extract`], would throw an error if you run [`~Datasets.load_dataset`] in streaming mode, i.e. with `streaming=True`. This is the reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`]. Files in TAR archives can't be accessed directly by their paths. Instead, you have to sequentially iterate over the files within the archive to find a specific file.
-
-</Tip>
+4. Use the [`~DownloadManager.download_and_extract`] method to download the metadata file specified in `_METADATA_URL`. This method returns a path to a local file in non-streaming mode. In streaming mode, it doesn't download file locally and returns the same URL.
 
 5. Now use the [`SplitGenerator`] to organize the audio files and metadata in each split. Name each split with a standard name like: `Split.TRAIN`, `Split.TEST`, and `SPLIT.Validation`.
 

diff --git a/docs/source/audio_load.mdx b/docs/source/audio_load.mdx
@@ -24,7 +24,7 @@ You can load your own dataset using the paths to your audio files. Use the [`~Da
 
 ## AudioFolder
 
-You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data.
+You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly creating and loading audio datasets with several thousand audio files.
 
 ## AudioFolder with metadata
 

diff --git a/docs/source/image_dataset.mdx b/docs/source/image_dataset.mdx
@@ -2,8 +2,8 @@
 
 There are two methods for creating and sharing an image dataset. This guide will show you how to:
 
-* Create an image dataset with `ImageFolder` and some metadata. This is a no-code solution for quickly creating an image dataset. 
-* Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated.
+* Create an image dataset with `ImageFolder` and some metadata. This is a no-code solution for quickly creating an image dataset with several thousand images.
+* Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale image datasets.
 
 <Tip>
 
@@ -13,7 +13,7 @@ You can control access to your dataset by requiring users to share their contact
 
 ## ImageFolder
 
-The `ImageFolder` is a dataset builder designed to quickly load an image dataset without requiring you to write any code. `ImageFolder` automatically infers the class labels of your dataset based on the directory name. Just store your dataset in a directory structure like:
+The `ImageFolder` is a dataset builder designed to quickly load an image dataset with several thousand images without requiring you to write any code. `ImageFolder` automatically infers the class labels of your dataset based on the directory name. Just store your dataset in a directory structure like:
 
 ```
 folder/train/dog/golden_retriever.png