Make several audio datasets streamable #3290

lhoestq · 2021-11-17T17:43:41Z

~~Needs #3129 to be merged first~~

Make those audio datasets streamable:

common_voice
openslr
vivos
librispeech_asr ~~(still has some issues to read FLAC)~~ actually it's ok
~~multilingual_librispeech (yet to be converted)~~ TODO in a separate PR

This reverts commit c973209.

albertvillanova · 2021-11-18T06:20:10Z

Reading FLAC (for librispeech_asr) works OK for me (soundfile version: 0.10.3):

In [2]: ds = load_dataset("datasets/librispeech_asr/librispeech_asr.py", "clean", streaming=True, split="train.100")

In [3]: item = next(iter(ds))

In [4]: item.keys()
Out[4]: dict_keys(['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'])

In [5]: item["file"]
Out[5]: '374-180298-0000.flac'

In [6]: item["audio"].keys()
Out[6]: dict_keys(['path', 'array', 'sampling_rate'])

In [7]: item["audio"]["sampling_rate"]
Out[7]: 16000

In [8]: item["audio"]["path"]
Out[8]: '374-180298-0000.flac'

In [9]: item["audio"]["array"].shape
Out[9]: (232480,)

lhoestq · 2021-11-18T11:40:24Z

Oh cool ! I think this might have come from an issue with my local soundfile installation then

lhoestq · 2021-11-18T16:42:43Z

I'll do multilingual_librispeech in a separate PR since it requires the data to be in another format (in particular separate the train/dev/test splits in different files)

patrickvonplaten · 2022-02-01T20:58:16Z

@lhoestq @albertvillanova - think it would have been nice to have added a big message at the top stating that this is a breaking change and ping transformers people a bit more here.

patrickvonplaten · 2022-02-01T21:00:51Z

datasets/common_voice/common_voice.py

-                    "filepath": os.path.join(abs_path_to_data, "train.tsv"),
-                    "path_to_clips": abs_path_to_clips,
+                    "files": dl_manager.iter_archive(archive),
+                    "filepath": "/".join([path_to_data, "train.tsv"]),


this is breaking no?

albertvillanova and others added 30 commits October 21, 2021 10:37

Add test fixture for TAR WAV file

25dca5f

Add test iter_archive

52cc44d

Test dataset with Audio feature for TAR archive

8ff699d

Add Audio method to decode from bytes instead of path

3d20ee5

Add Audio support for bytes besides path

105ead7

Fix docstring

a869469

Stream TAR-based Audio datasets

f0911cd

Merge remote-tracking branch 'upstream/master' into audio-tar

79465af

Remove archived attribute from test audio with TAR archive

f224b68

Remove archived attribute from Audio feature

ebb1a1c

Implement Audio.encode_example

1cc27a0

Call Audio.encode_example from encode_nested_example

4579b76

Fix docs

0d2a3d8

Enhance Audio.decode_example to accept a string

3d35ada

Fix docs

ec5f7b0

Implement private Audio._storage_dtype to specify cached dtype

21488c0

Change Audio._storage_dtype dynamically when encoding a string

83f04cd

Update test of Audio instantiation

7a3f066

Set ArrowWriter.schema property dynamically calculated from features

ece5b97

Update ArrowWriter.write_examples_on_file

38c80cc

Update ArrowWriter._build_writer

7787985

Fix code quality

090723e

Replace _schema with schema and condition on schema in ArrowWriter

7f58777

Add test for MP3 TAR audio file

583be77

Refactor Audio decode_example

8dbe0d7

Pass raw bytes to torchaudio.load

c973209

Revert "Pass raw bytes to torchaudio.load"

7363e9a

This reverts commit c973209.

Pass format to load in _decode_example_with_torchaudio

9f61ab8

Fix filename extension in test

efa4c25

Fix Audio tests CI

659fb78

albertvillanova and others added 13 commits November 16, 2021 14:10

Fix Audio tests CI

2fc997a

Fix audio test CI by checking out PR HEAD commit instead of merge commit

416d1bf

Merge remote-tracking branch 'upstream/master' into audio-tar

1e5dc25

Change default Audio storage dtype to string

5f16240

Rename Audio decode functions

488b74a

Refactor Audio decode_example

0ae5d44

Force CI re-run

4679d8e

Refactor and rename

e178cc7

Fix docstring

4c4a687

Merge branch 'master' into stream-tar-audio

eb923d2

put back the Audio feature

adbcc25

Merge branch 'audio-tar' into stream-tar-audio

de4d5f9

fix openslr

25f1806

lhoestq changed the title ~~Stream tar audio~~ Make several audio datasets streamable Nov 17, 2021

Merge branch 'master' into stream-tar-audio

1b441dd

Quentin Lhoest and others added 3 commits November 19, 2021 09:56

fix common_voice

7f67477

update infos

45ed8cd

fix dummy data

63d0d47

lhoestq marked this pull request as ready for review November 19, 2021 15:08

lhoestq merged commit 0534a87 into master Nov 19, 2021

lhoestq deleted the stream-tar-audio branch November 19, 2021 15:08

patrickvonplaten mentioned this pull request Feb 1, 2022

[Audio] Path of Common Voice cannot be used for audio loading anymore #3663

Closed

patrickvonplaten reviewed Feb 1, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make several audio datasets streamable #3290

Make several audio datasets streamable #3290

lhoestq commented Nov 17, 2021 •

edited

Loading

albertvillanova commented Nov 18, 2021

lhoestq commented Nov 18, 2021

lhoestq commented Nov 18, 2021

patrickvonplaten commented Feb 1, 2022

patrickvonplaten Feb 1, 2022

Make several audio datasets streamable #3290

Make several audio datasets streamable #3290

Conversation

lhoestq commented Nov 17, 2021 • edited Loading

albertvillanova commented Nov 18, 2021

lhoestq commented Nov 18, 2021

lhoestq commented Nov 18, 2021

patrickvonplaten commented Feb 1, 2022

patrickvonplaten Feb 1, 2022

Choose a reason for hiding this comment

lhoestq commented Nov 17, 2021 •

edited

Loading