Skip to content

Commit

Permalink
Merge branch 'main' into sharded-save_to_disk
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Dec 8, 2022
2 parents 24e24bf + 45508f7 commit 16bb14c
Show file tree
Hide file tree
Showing 48 changed files with 5,935 additions and 5,949 deletions.
56 changes: 47 additions & 9 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ jobs:
matrix:
test: ['unit', 'integration']
os: [ubuntu-latest, windows-latest]
deps_versions: [latest, minimum]
deps_versions: [deps-latest, deps-minimum]
continue-on-error: ${{ matrix.test == 'integration' }}
runs-on: ${{ matrix.os }}
steps:
Expand All @@ -52,25 +52,23 @@ jobs:
- name: Set up Python 3.7
uses: actions/setup-python@v4
with:
python-version: 3.7
python-version: "3.7"
- name: Upgrade pip
run: python -m pip install --upgrade pip
- name: Pin setuptools-scm
if: ${{ matrix.os == 'ubuntu-latest' }}
run: echo "installing pinned version of setuptools-scm to fix seqeval installation on 3.7" && pip install "setuptools-scm==6.4.2"
- name: Install dependencies
run: |
pip install .[tests]
pip install .[tests,metrics-tests]
pip install -r additional-tests-requirements.txt --no-deps
python -m spacy download en_core_web_sm
python -m spacy download fr_core_news_sm
- name: Install dependencies (latest versions)
if: ${{ matrix.deps_versions == 'latest' }}
run: |
pip uninstall -y apache-beam
pip install --upgrade pyarrow huggingface-hub dill
- name: Install dependencies (minimum versions)
if: ${{ matrix.deps_versions != 'latest' }}
if: ${{ matrix.deps_versions == 'deps-latest' }}
run: pip install --upgrade pyarrow huggingface-hub dill
- name: Install depencencies (minimum versions)
if: ${{ matrix.deps_versions != 'deps-latest' }}
run: pip install pyarrow==6.0.1 huggingface-hub==0.2.0 transformers dill==0.3.1.1
- name: Test with pytest
run: |
Expand All @@ -85,3 +83,43 @@ jobs:
if: ${{ matrix.os == 'ubuntu-latest' }}
run: |
python -m pytest -rfExX -m torchaudio_latest -n 2 --dist loadfile -sv ./tests/features/test_audio.py
test_py310:
needs: check_code_quality
strategy:
matrix:
test: ['unit']
os: [ubuntu-latest, windows-latest]
deps_versions: [deps-latest]
continue-on-error: false
runs-on: ${{ matrix.os }}
steps:
- name: Install OS dependencies
if: ${{ matrix.os == 'ubuntu-latest' }}
run: |
sudo apt-get -y update
sudo apt-get -y install libsndfile1 sox
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Upgrade pip
run: python -m pip install --upgrade pip
- name: Install dependencies
run: pip install .[tests]
- name: Test with pytest
run: |
python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
- name: Install dependencies to test torchaudio>=0.12 on Ubuntu
if: ${{ matrix.os == 'ubuntu-latest' }}
run: |
pip uninstall -y torchaudio torch
pip install "torchaudio>=0.12"
sudo apt-get -y install ffmpeg
- name: Test torchaudio>=0.12 on Ubuntu
if: ${{ matrix.os == 'ubuntu-latest' }}
run: |
python -m pytest -rfExX -m torchaudio_latest -n 2 --dist loadfile -sv ./tests/features/test_audio.py
33 changes: 26 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

🤗 Datasets is a lightweight library providing **two** main features:

- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (text datasets in 467 languages and dialects, image datasets, audio datasets, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text/PNG/JPEG/etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.

[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)

Expand All @@ -46,6 +46,8 @@
- Smart caching: never wait for your data to process several times.
- Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
- Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.
- Native support for audio and image data
- Enable streaming mode to save disk space and start iterating over the dataset immediately.

🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between--datasets-and-tfds).

Expand Down Expand Up @@ -108,13 +110,24 @@ tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)
```

If your dataset is bigger than your disk or if you don't want to wait to download the data, you can use streaming:

```python
# If you want to use the dataset immediately and efficiently stream the data as you iterate over the dataset
image_dataset = load_dataset('cifar100', streaming=True)
for example in image_dataset["train"]:
break
```

For more details on using the library, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart.html and the specific pages on:

- Loading a dataset https://huggingface.co/docs/datasets/loading
- Loading a dataset: https://huggingface.co/docs/datasets/loading
- What's in a Dataset: https://huggingface.co/docs/datasets/access
- Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/process
- Processing audio data: https://huggingface.co/docs/datasets/audio_process
- Processing image data: https://huggingface.co/docs/datasets/image_process
- Processing audio data: https://huggingface.co/docs/datasets/audio_process
- Processing image data: https://huggingface.co/docs/datasets/image_process
- Processing text data: https://huggingface.co/docs/datasets/nlp_process
- Streaming a dataset: https://huggingface.co/docs/datasets/stream
- Writing your own dataset loading script: https://huggingface.co/docs/datasets/dataset_script
- etc.

Expand All @@ -125,7 +138,9 @@ Another introduction to 🤗 Datasets is the tutorial on Google Colab here:

We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).

You will find [the step-by-step guide here](https://huggingface.co/docs/datasets/share.html) to add a dataset on the Hub.
You can find:
- [how to upload a dataset to the Hub using your web browser or Python](https://huggingface.co/docs/datasets/upload_dataset) and also
- [how to upload it using Git](https://huggingface.co/docs/datasets/share).

# Main differences between 🤗 Datasets and `tfds`

Expand All @@ -140,7 +155,11 @@ If you are familiar with the great TensorFlow Datasets, here are the main differ

Similar to TensorFlow Datasets, 🤗 Datasets is a utility library that downloads and prepares public datasets. We do not host or distribute most of these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a [GitHub issue](https://github.com/huggingface/datasets/issues/new). Thanks for your contribution to the ML community!
Moreover 🤗 Datasets may run Python code defined by the dataset authors to parse certain data formats or structures. For security reasons, we ask users to:
- check the dataset scripts they're going to run beforehand and
- pin the `revision` of the repositories they use.

If you're a dataset owner and wish to update any part of it (description, citation, license, etc.), or do not want your dataset to be included in the Hugging Face Hub, please get in touch by opening a discussion or a pull request in the Community tab of the dataset page. Thanks for your contribution to the ML community!

## BibTeX

Expand Down
28 changes: 11 additions & 17 deletions docs/source/audio_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ dataset = load_dataset("<username>/my_dataset")

There are several methods for creating and sharing an audio dataset:

1. Create an audio dataset from local files in python with [`Dataset.push_to_hub`]. This is an easy way that requires only a few steps in python.
* Create an audio dataset from local files in python with [`Dataset.push_to_hub`]. This is an easy way that requires only a few steps in python.

1. Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating small dataset to experiment with.
* Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.

1. Create an audio dataset by writing a loading script. This method is for advanced users and requires more effort and coding, but you have greater flexibility over how a dataset is defined, downloaded, and generated.
* Create an audio dataset by writing a loading script. This method is for advanced users and requires more effort and coding, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale audio datasets.


<Tip>
Expand Down Expand Up @@ -53,7 +53,7 @@ my_dataset/

## AudioFolder

The `AudioFolder` is a dataset builder designed to quickly load an audio dataset without requiring you to write any code.
The `AudioFolder` is a dataset builder designed to quickly load an audio dataset with several thousand audio files without requiring you to write any code.
Any additional information about your dataset - such as transcription, speaker accent, or speaker intent - is automatically loaded by `AudioFolder` as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`).

Create a dataset repository on the Hugging Face Hub and upload your dataset directory following the `AudioFolder` structure:
Expand Down Expand Up @@ -192,9 +192,9 @@ This directory structure allows your dataset to be loaded in one line:
```

This guide will show you how to create a dataset loading script for audio datasets, which is a bit different from <a class="underline decoration-green-400 decoration-2 font-semibold" href="./dataset_script">creating a loading script for text datasets</a>.
Audio datasets are commonly stored in `tar.gz` archives which requires a particular approach to support streaming mode. While streaming is not required, we highly encourage enabling streaming support in your audio dataset because:
Audio datasets are commonly stored in `tar.gz` archives which requires a particular approach to support streaming mode. While streaming is not required, we highly encourage implementing streaming support in your audio dataset because:

1. Users without a lot of disk space can use your dataset without waiting for the entire dataset to be downloaded. Learn more about streaming in the [Stream](./stream) guide!
1. Users without a lot of disk space can use your dataset without downloading it. Learn more about streaming in the [Stream](./stream) guide!
2. Users can preview a dataset in the dataset viewer.

Here is an example using TAR archives:
Expand Down Expand Up @@ -367,7 +367,7 @@ def _info(self):

Now that you've added some information about your dataset, the next step is to download the dataset and define the splits.

1. Use the [`~DownloadManager.download`] method to download metadata file at `_PROMPTS_URLS` and audio TAR archive at `_DATA_URL`. This method returns the path to the local file/archive. In streaming mode, it returns a URL to stream the data from. This method accepts:
1. Use the [`~DownloadManager.download`] method to download metadata file at `_PROMPTS_URLS` and audio TAR archive at `_DATA_URL`. This method returns the path to the local file/archive. In streaming mode, it doesn't download the file(s) and just returns a URL to stream the data from. This method accepts:

* a relative path to a file inside a Hub dataset repository (for example, in the `data/` folder)
* a URL to a file hosted somewhere else
Expand Down Expand Up @@ -504,21 +504,15 @@ To explain how to do the extraction in a way that it also supports streaming, we
local_extracted_archive = dl_manager.extract(audio_path) if not dl_manager.is_streaming else None
```

3. Use the [`~DownloadManager.iter_archive`] method to iterate over the archive at `audio_path` after it's downloaded, just like in the Vivos example above. [`~DownloadManager.iter_archive`] doesn't provide any information about the full paths of files from the archive, even if it has been extracted. As a result, you need to pass the `local_extracted_archive` path to the next step in `gen_kwargs`, in order to preserve information about where the archive was extracted to. This is required to construct the correct paths to the local files when you generate the examples.
3. Use the [`~DownloadManager.iter_archive`] method to iterate over the archive at `audio_path`, just like in the Vivos example above. [`~DownloadManager.iter_archive`] doesn't provide any information about the full paths of files from the archive, even if it has been extracted. As a result, you need to pass the `local_extracted_archive` path to the next step in `gen_kwargs`, in order to preserve information about where the archive was extracted to. This is required to construct the correct paths to the local files when you generate the examples.

<Tip>
<Tip warning={true}>

The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths. Instead, you'll need to download it first and then sequentially iterate over the files within the archive!
The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because files in TAR archives can't be accessed directly by their paths. Instead, you'll need to iterate over the files within the archive! You can use [`~DownloadManager.download_and_extract`] and [`~DownloadManager.extract`] with TAR archives only in non-streaming mode, otherwise it would throw an error.

</Tip>

4. Use the [`~DownloadManager.download_and_extract`] method to download the metadata file specified in `_METADATA_URL`. This method returns a path to a local file in non-streaming mode. In streaming mode, it opens the file at the URL remotely and returns this URL.

<Tip>

You can use [`~DownloadManager.download_and_extract`] to download and extract TAR archives too, but this method, as well as [`~DownloadManager.extract`], would throw an error if you run [`~Datasets.load_dataset`] in streaming mode, i.e. with `streaming=True`. This is the reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`]. Files in TAR archives can't be accessed directly by their paths. Instead, you have to sequentially iterate over the files within the archive to find a specific file.

</Tip>
4. Use the [`~DownloadManager.download_and_extract`] method to download the metadata file specified in `_METADATA_URL`. This method returns a path to a local file in non-streaming mode. In streaming mode, it doesn't download file locally and returns the same URL.

5. Now use the [`SplitGenerator`] to organize the audio files and metadata in each split. Name each split with a standard name like: `Split.TRAIN`, `Split.TEST`, and `SPLIT.Validation`.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/audio_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ You can load your own dataset using the paths to your audio files. Use the [`~Da

## AudioFolder

You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data.
You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly creating and loading audio datasets with several thousand audio files.

## AudioFolder with metadata

Expand Down
6 changes: 3 additions & 3 deletions docs/source/image_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

There are two methods for creating and sharing an image dataset. This guide will show you how to:

* Create an image dataset with `ImageFolder` and some metadata. This is a no-code solution for quickly creating an image dataset.
* Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated.
* Create an image dataset with `ImageFolder` and some metadata. This is a no-code solution for quickly creating an image dataset with several thousand images.
* Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale image datasets.

<Tip>

Expand All @@ -13,7 +13,7 @@ You can control access to your dataset by requiring users to share their contact

## ImageFolder

The `ImageFolder` is a dataset builder designed to quickly load an image dataset without requiring you to write any code. `ImageFolder` automatically infers the class labels of your dataset based on the directory name. Just store your dataset in a directory structure like:
The `ImageFolder` is a dataset builder designed to quickly load an image dataset with several thousand images without requiring you to write any code. `ImageFolder` automatically infers the class labels of your dataset based on the directory name. Just store your dataset in a directory structure like:

```
folder/train/dog/golden_retriever.png
Expand Down
Loading

1 comment on commit 16bb14c

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009337 / 0.011353 (-0.002015) 0.005082 / 0.011008 (-0.005926) 0.098187 / 0.038508 (0.059679) 0.035545 / 0.023109 (0.012435) 0.299004 / 0.275898 (0.023106) 0.352877 / 0.323480 (0.029397) 0.007892 / 0.007986 (-0.000094) 0.004041 / 0.004328 (-0.000287) 0.075650 / 0.004250 (0.071400) 0.044868 / 0.037052 (0.007815) 0.312597 / 0.258489 (0.054108) 0.340515 / 0.293841 (0.046674) 0.037524 / 0.128546 (-0.091023) 0.011949 / 0.075646 (-0.063698) 0.333244 / 0.419271 (-0.086028) 0.047109 / 0.043533 (0.003576) 0.295913 / 0.255139 (0.040774) 0.315653 / 0.283200 (0.032454) 0.108617 / 0.141683 (-0.033066) 1.483163 / 1.452155 (0.031008) 1.515666 / 1.492716 (0.022950)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.202708 / 0.018006 (0.184702) 0.440728 / 0.000490 (0.440238) 0.001022 / 0.000200 (0.000822) 0.000080 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026173 / 0.037411 (-0.011238) 0.106830 / 0.014526 (0.092304) 0.116486 / 0.176557 (-0.060070) 0.155094 / 0.737135 (-0.582042) 0.122726 / 0.296338 (-0.173613)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.400548 / 0.215209 (0.185339) 3.992175 / 2.077655 (1.914520) 1.790516 / 1.504120 (0.286396) 1.599167 / 1.541195 (0.057972) 1.646753 / 1.468490 (0.178263) 0.690536 / 4.584777 (-3.894241) 3.780077 / 3.745712 (0.034365) 3.239050 / 5.269862 (-2.030812) 1.722358 / 4.565676 (-2.843318) 0.083568 / 0.424275 (-0.340708) 0.011926 / 0.007607 (0.004319) 0.512682 / 0.226044 (0.286637) 5.064178 / 2.268929 (2.795249) 2.222592 / 55.444624 (-53.222033) 1.880327 / 6.876477 (-4.996150) 1.977748 / 2.142072 (-0.164324) 0.834533 / 4.805227 (-3.970694) 0.164521 / 6.500664 (-6.336143) 0.062170 / 0.075469 (-0.013299)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.403466 / 1.841788 (-0.438322) 14.714412 / 8.074308 (6.640103) 11.914744 / 10.191392 (1.723352) 0.789690 / 0.680424 (0.109266) 0.524933 / 0.534201 (-0.009268) 0.430875 / 0.579283 (-0.148408) 0.408122 / 0.434364 (-0.026242) 0.252650 / 0.540337 (-0.287688) 0.250462 / 1.386936 (-1.136474)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007481 / 0.011353 (-0.003872) 0.005123 / 0.011008 (-0.005885) 0.095380 / 0.038508 (0.056872) 0.033062 / 0.023109 (0.009953) 0.368551 / 0.275898 (0.092652) 0.416462 / 0.323480 (0.092982) 0.005703 / 0.007986 (-0.002283) 0.003938 / 0.004328 (-0.000390) 0.071322 / 0.004250 (0.067071) 0.041011 / 0.037052 (0.003958) 0.378818 / 0.258489 (0.120328) 0.431621 / 0.293841 (0.137780) 0.036781 / 0.128546 (-0.091766) 0.011855 / 0.075646 (-0.063791) 0.329296 / 0.419271 (-0.089976) 0.047820 / 0.043533 (0.004287) 0.370935 / 0.255139 (0.115796) 0.399883 / 0.283200 (0.116683) 0.104981 / 0.141683 (-0.036702) 1.460105 / 1.452155 (0.007950) 1.518015 / 1.492716 (0.025299)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.223805 / 0.018006 (0.205799) 0.444366 / 0.000490 (0.443877) 0.001049 / 0.000200 (0.000849) 0.000082 / 0.000054 (0.000027)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028924 / 0.037411 (-0.008487) 0.111306 / 0.014526 (0.096781) 0.121257 / 0.176557 (-0.055299) 0.161741 / 0.737135 (-0.575395) 0.126669 / 0.296338 (-0.169670)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.438361 / 0.215209 (0.223152) 4.353685 / 2.077655 (2.276031) 2.185084 / 1.504120 (0.680964) 2.005412 / 1.541195 (0.464217) 2.046318 / 1.468490 (0.577828) 0.686906 / 4.584777 (-3.897871) 3.755917 / 3.745712 (0.010205) 2.067455 / 5.269862 (-3.202407) 1.316099 / 4.565676 (-3.249578) 0.085162 / 0.424275 (-0.339113) 0.011922 / 0.007607 (0.004315) 0.547884 / 0.226044 (0.321839) 5.518501 / 2.268929 (3.249573) 2.720449 / 55.444624 (-52.724176) 2.359559 / 6.876477 (-4.516918) 2.477571 / 2.142072 (0.335498) 0.836104 / 4.805227 (-3.969123) 0.167490 / 6.500664 (-6.333174) 0.063989 / 0.075469 (-0.011480)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.535444 / 1.841788 (-0.306343) 14.895141 / 8.074308 (6.820833) 12.202024 / 10.191392 (2.010632) 0.904579 / 0.680424 (0.224155) 0.592471 / 0.534201 (0.058270) 0.415791 / 0.579283 (-0.163492) 0.409800 / 0.434364 (-0.024564) 0.247866 / 0.540337 (-0.292472) 0.252835 / 1.386936 (-1.134101)

Please sign in to comment.