Add Librispeech ASR #1767

patrickvonplaten · 2021-01-22T14:54:37Z

This PR adds the librispeech asr dataset: https://www.tensorflow.org/datasets/catalog/librispeech

There are 2 configs: "clean" and "other" whereas there are two "train" datasets for "clean", hence the name "train.100" and "train.360".

As suggested by @lhoestq, due to the enormous size of the dataset in .arrow format, the speech files are not directly prepared to a float32-array, but instead just the path to the array file is stored.

datasets/librispeech_asr/librispeech_asr.py

src/datasets/load.py

lhoestq

Nice thank you !

I added a few comments

datasets/librispeech_asr/README.md

datasets/librispeech_asr/librispeech_asr.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

…aten/datasets-1 into add_librispeech_asr

lhoestq

Awesome thank you !

The dummy data are quite big but it was expected given that the raw files are flac files.
Given that the script doesn't even read the flac files I think we can remove them. Or maybe use empty flac files (see here for example). What do you think ?

We'll find a better solution to be able to have bigger dummy_data (max 1MB instead of a few KB, maybe using git LFS.

patrickvonplaten · 2021-01-25T20:32:46Z

Awesome thank you !

The dummy data are quite big but it was expected given that the raw files are flac files.
Given that the script doesn't even read the flac files I think we can remove them. Or maybe use empty flac files (see here for example). What do you think ?

We'll find a better solution to be able to have bigger dummy_data (max 1MB instead of a few KB, maybe using git LFS.

Hmm, I already made the dummy data as small as possible (a single flac filie per split only). I'd like to keep them at least to have complete dummy data and don't think 500KB for all datasets together is a problem (the long-range summarization datasets are similarly heavy). The moment we allow dummy data to be loaded directly for testing, we need the flac files IMO.

But I agree that longterm, we need a better solution for the dummy data (maybe stop hosting it on github to not make the repo too heavy)

add script

73f5a49

patrickvonplaten changed the title ~~add script~~ [WIP] add script Jan 22, 2021

patrickvonplaten added 4 commits January 24, 2021 20:44

correct librispeech

3df6394

correct description

9549cc7

add dataset_infos json

8ea5011

finish librispeech asr

373d628

patrickvonplaten changed the title ~~[WIP] add script~~ Add Librispeech ASR Jan 25, 2021

patrickvonplaten added 4 commits January 25, 2021 10:06

reduce size dummy data

43f4913

fix encoding

18ce4dc

add readme

ad182f8

finish

897ed5c

patrickvonplaten commented Jan 25, 2021

View reviewed changes

datasets/librispeech_asr/librispeech_asr.py Show resolved Hide resolved

fix docstring regex

c304e63

patrickvonplaten commented Jan 25, 2021

View reviewed changes

src/datasets/load.py Show resolved Hide resolved

lhoestq reviewed Jan 25, 2021

View reviewed changes

patrickvonplaten and others added 6 commits January 25, 2021 15:45

Update datasets/librispeech_asr/README.md

bad151f

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update datasets/librispeech_asr/README.md

0e6a11e

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update datasets/librispeech_asr/librispeech_asr.py

942b38e

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

apply quentins commens

9b60219

Merge branch 'add_librispeech_asr' of https://github.com/patrickvonpl…

c0f628d

…aten/datasets-1 into add_librispeech_asr

correct librispeech

b2e897e

patrickvonplaten requested a review from lhoestq January 25, 2021 16:58

lhoestq reviewed Jan 25, 2021

View reviewed changes

patrickvonplaten merged commit 312a2d6 into huggingface:master Jan 25, 2021

patrickvonplaten deleted the add_librispeech_asr branch January 25, 2021 20:38

anton-l mentioned this pull request Feb 15, 2021

Add LJ Speech dataset #1878

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Librispeech ASR #1767

Add Librispeech ASR #1767

patrickvonplaten commented Jan 22, 2021 •

edited

Loading

lhoestq left a comment

lhoestq left a comment

patrickvonplaten commented Jan 25, 2021 •

edited

Loading

Add Librispeech ASR #1767

Add Librispeech ASR #1767

Conversation

patrickvonplaten commented Jan 22, 2021 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Jan 25, 2021 • edited Loading

patrickvonplaten commented Jan 22, 2021 •

edited

Loading

patrickvonplaten commented Jan 25, 2021 •

edited

Loading