Skip to content

Commit

Permalink
Add docs for audio processing (#3222)
Browse files Browse the repository at this point in the history
* ✨ add docs for audio processing

* add new doc to toctree

* minor fixes

* add feedback from review

* improve gif

* add feedback from review
  • Loading branch information
stevhliu authored Nov 24, 2021
1 parent ace1d76 commit a8f96b3
Show file tree
Hide file tree
Showing 4 changed files with 127 additions and 1 deletion.
125 changes: 125 additions & 0 deletions docs/source/audio_process.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
Process audio data
==================

🤗 Datasets supports an :class:`datasets.Audio` feature, enabling users to load and process raw audio files for training. This guide will show you how to:

* Load your own custom audio dataset.
* Resample audio files.
* Use :func:`datasets.Dataset.map` with audio files.

Installation
------------

The :class:`datasets.Audio` feature is an experimental feature and should be installed as an extra dependency in 🤗 Datasets. Install the :class:`datasets.Audio` feature with pip:

.. code::
>>> pip install datasets[audio]
Users should also install `torchaudio <https://pytorch.org/audio/stable/index.html>`_ and `librosa <https://librosa.org/doc/latest/index.html>`_, two common libraries used by 🤗 Datasets for handling audio data.

.. code::
>>> pip install librosa
>>> pip install torchaudio
.. important::

torchaudio's ``sox_io`` `backend <https://pytorch.org/audio/stable/backend.html#>`_ supports decoding ``mp3`` files. Unfortunately, the ``sox_io`` backend is only available on Linux/macOS, and is not supported by Windows.

Then you can load an audio dataset the same way you would load a text dataset. For example, load the `Common Voice <https://huggingface.co/datasets/common_voice>`_ dataset with the Turkish configuration:

.. code-block::
>>> from datasets import load_dataset, load_metric, Audio
>>> common_voice = load_dataset("common_voice", "tr", split="train")
Audio datasets
--------------

Audio datasets commonly have an ``audio`` and ``path`` or ``file`` column.

``audio`` is the actual audio file that is loaded and resampled on-the-fly upon calling it.

.. code::
>>> common_voice[0]["audio"]
{'array': array([ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ...,
-8.8930130e-05, -3.8027763e-05, -2.9146671e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3',
'sampling_rate': 48000}
When you access an audio file, it is automatically decoded and resampled. Generally, you should query an audio file like: ``common_voice[0]["audio"]``. If you query an audio file with ``common_voice["audio"][0]`` instead, **all** the audio files in your dataset will be decoded and resampled. This process can take a long time if you have a large dataset.

``path`` or ``file`` is an absolute path to an audio file.

.. code::
>>> common_voice[0]["path"]
/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3
The ``path`` is useful if you want to load your own audio dataset. In this case, provide a column of audio file paths to :meth:`datasets.Dataset.cast_column`:

.. code::
>>> my_audio_dataset = my_audio_dataset.cast_column("paths_to_my_audio_files", Audio())
Resample
--------

Some models expect the audio data to have a certain sampling rate due to how the model was pretrained. For example, the `XLSR-Wav2Vec2 <https://huggingface.co/facebook/wav2vec2-large-xlsr-53>`_ model expects the input to have a sampling rate of 16kHz, but an audio file from the Common Voice dataset has a sampling rate of 48kHz. You can use :meth:`datasets.Dataset.cast_column` to downsample the sampling rate to 16kHz:

.. code::
>>> common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16_000))
The next time you load the audio file, the :class:`datasets.Audio` feature will load and resample it to 16kHz:

>>> common_voice_train[0]["audio"]
{'array': array([ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ...,
-7.4556941e-05, -1.4621433e-05, -5.7861507e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3',
'sampling_rate': 16000}

.. image:: /imgs/resample.gif
:align: center

``Map``
-------

Just like text datasets, you can apply a preprocessing function over an entire dataset with :func:`datasets.Dataset.map`, which is useful for preprocessing all of your audio data at once. Start with a `speech recognition model <https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads>`_ of your choice, and load a ``processor`` object that contains:

1. A feature extractor to convert the speech signal to the model's input format. Every speech recognition model on the 🤗 `Hub <https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads>`_ contains a predefined feature extractor that can be easily loaded with ``AutoFeatureExtractor.from_pretrained(...)``.

2. A tokenizer to convert the model's output format to text. Fine-tuned speech recognition models, such as `facebook/wav2vec2-base-960h <https://huggingface.co/facebook/wav2vec2-base-960h>`_, contain a predefined tokenizer that can be easily loaded with ``AutoTokenizer.from_pretrained(...)``.

For pretrained speech recognition models, such as `facebook/wav2vec2-large-xlsr-53 <https://huggingface.co/facebook/wav2vec2-large-xlsr-53>`_, a tokenizer needs to be created from the target text as explained `here <https://huggingface.co/blog/fine-tune-wav2vec2-english>`_. The following example demonstrates how to load a feature extractor, tokenizer and processor for a pretrained speech recognition model:

.. code-block::
>>> from transformers import AutoTokenizer, AutoFeatureExtractor, Wav2Vec2Processor
>>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
>>> # after defining a vocab.json file you can instantiate a tokenizer object:
>>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
>>> processor = Wav2Vec2Processor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
For fine-tuned speech recognition models, you can simply load a predefined ``processor`` object with:

.. code-block::
>>> from transformers import Wav2Vec2Processor
>>> processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
Make sure to include the ``audio`` key in your preprocessing function when you call :func:`datasets.Dataset.map` so that you are actually resampling the audio data:

.. code-block::
>>> def prepare_dataset(batch):
... audio = batch["audio"]
... batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
... batch["input_length"] = len(batch["input_values"])
... with processor.as_target_processor():
... batch["labels"] = processor(batch["sentence"]).input_ids
... return batch
>>> common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names)
Binary file added docs/source/imgs/resample.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ Find your dataset today on the `Hugging Face Hub <https://huggingface.co/dataset
how_to
loading
process
audio_process
stream
share
dataset_script
Expand Down
2 changes: 1 addition & 1 deletion docs/source/use_dataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,6 @@ means they can be passed directly to methods like `model.fit()`. `to_tf_dataset(
.. tip::

``to_tf_dataset`` is the easiest way to create a TensorFlow compatible dataset. If you don't want a `tf.data.Dataset` and would rather the dataset emit `tf.Tensor` objects, take a look at the :ref:`format` section instead!
``to_tf_dataset`` is the easiest way to create a TensorFlow compatible dataset. If you don't want a ``tf.data.Dataset`` and would rather the dataset emit ``tf.Tensor`` objects, take a look at the :ref:`format` section instead!

Your dataset is now ready for use in a training loop!

1 comment on commit a8f96b3

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.070669 / 0.011353 (0.059317) 0.003912 / 0.011008 (-0.007096) 0.031283 / 0.038508 (-0.007225) 0.035377 / 0.023109 (0.012268) 0.299836 / 0.275898 (0.023938) 0.333071 / 0.323480 (0.009591) 0.080473 / 0.007986 (0.072487) 0.004315 / 0.004328 (-0.000013) 0.008984 / 0.004250 (0.004733) 0.040233 / 0.037052 (0.003181) 0.301728 / 0.258489 (0.043238) 0.344964 / 0.293841 (0.051123) 0.085063 / 0.128546 (-0.043483) 0.008781 / 0.075646 (-0.066865) 0.253784 / 0.419271 (-0.165488) 0.045962 / 0.043533 (0.002429) 0.313604 / 0.255139 (0.058465) 0.323853 / 0.283200 (0.040653) 0.082090 / 0.141683 (-0.059592) 1.730257 / 1.452155 (0.278102) 1.808205 / 1.492716 (0.315488)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.308173 / 0.018006 (0.290166) 0.439881 / 0.000490 (0.439391) 0.040651 / 0.000200 (0.040451) 0.000530 / 0.000054 (0.000475)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035699 / 0.037411 (-0.001712) 0.021917 / 0.014526 (0.007391) 0.028594 / 0.176557 (-0.147962) 0.197083 / 0.737135 (-0.540052) 0.029387 / 0.296338 (-0.266951)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.423067 / 0.215209 (0.207858) 4.239895 / 2.077655 (2.162240) 1.778187 / 1.504120 (0.274068) 1.561749 / 1.541195 (0.020555) 1.627922 / 1.468490 (0.159432) 0.424328 / 4.584777 (-4.160449) 4.699198 / 3.745712 (0.953485) 2.069236 / 5.269862 (-3.200625) 0.909363 / 4.565676 (-3.656314) 0.051121 / 0.424275 (-0.373154) 0.010917 / 0.007607 (0.003309) 0.529138 / 0.226044 (0.303094) 5.312797 / 2.268929 (3.043869) 2.246831 / 55.444624 (-53.197793) 1.876172 / 6.876477 (-5.000305) 1.977065 / 2.142072 (-0.165007) 0.544305 / 4.805227 (-4.260922) 0.114059 / 6.500664 (-6.386605) 0.055765 / 0.075469 (-0.019704)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.548345 / 1.841788 (-0.293442) 12.017016 / 8.074308 (3.942707) 26.972459 / 10.191392 (16.781067) 0.781409 / 0.680424 (0.100986) 0.521762 / 0.534201 (-0.012439) 0.367343 / 0.579283 (-0.211940) 0.501298 / 0.434364 (0.066934) 0.255419 / 0.540337 (-0.284918) 0.276229 / 1.386936 (-1.110707)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.068975 / 0.011353 (0.057622) 0.003895 / 0.011008 (-0.007114) 0.029571 / 0.038508 (-0.008937) 0.033657 / 0.023109 (0.010547) 0.325750 / 0.275898 (0.049852) 0.354988 / 0.323480 (0.031508) 0.084786 / 0.007986 (0.076801) 0.004290 / 0.004328 (-0.000038) 0.007189 / 0.004250 (0.002939) 0.043704 / 0.037052 (0.006652) 0.319301 / 0.258489 (0.060812) 0.353788 / 0.293841 (0.059947) 0.084530 / 0.128546 (-0.044016) 0.008748 / 0.075646 (-0.066898) 0.252359 / 0.419271 (-0.166913) 0.044848 / 0.043533 (0.001315) 0.324031 / 0.255139 (0.068892) 0.345542 / 0.283200 (0.062342) 0.083065 / 0.141683 (-0.058618) 1.682826 / 1.452155 (0.230671) 1.741564 / 1.492716 (0.248848)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.245823 / 0.018006 (0.227817) 0.444720 / 0.000490 (0.444230) 0.001396 / 0.000200 (0.001197) 0.000086 / 0.000054 (0.000032)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032934 / 0.037411 (-0.004477) 0.021707 / 0.014526 (0.007181) 0.027020 / 0.176557 (-0.149537) 0.198498 / 0.737135 (-0.538638) 0.029310 / 0.296338 (-0.267028)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.431470 / 0.215209 (0.216261) 4.344458 / 2.077655 (2.266803) 1.939601 / 1.504120 (0.435481) 1.726762 / 1.541195 (0.185567) 1.774399 / 1.468490 (0.305909) 0.413655 / 4.584777 (-4.171122) 4.672986 / 3.745712 (0.927274) 2.049758 / 5.269862 (-3.220103) 0.901313 / 4.565676 (-3.664364) 0.049980 / 0.424275 (-0.374295) 0.010951 / 0.007607 (0.003343) 0.541948 / 0.226044 (0.315904) 5.405807 / 2.268929 (3.136879) 2.443640 / 55.444624 (-53.000984) 2.085300 / 6.876477 (-4.791177) 2.154389 / 2.142072 (0.012317) 0.528400 / 4.805227 (-4.276827) 0.114966 / 6.500664 (-6.385698) 0.055808 / 0.075469 (-0.019661)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.544652 / 1.841788 (-0.297135) 11.610586 / 8.074308 (3.536277) 26.802606 / 10.191392 (16.611214) 0.775511 / 0.680424 (0.095087) 0.516661 / 0.534201 (-0.017539) 0.369299 / 0.579283 (-0.209984) 0.519118 / 0.434364 (0.084754) 0.256568 / 0.540337 (-0.283769) 0.265835 / 1.386936 (-1.121101)

CML watermark

Please sign in to comment.