Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Librispeech ASR #1767

Merged

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Jan 22, 2021

This PR adds the librispeech asr dataset: https://www.tensorflow.org/datasets/catalog/librispeech

There are 2 configs: "clean" and "other" whereas there are two "train" datasets for "clean", hence the name "train.100" and "train.360".

As suggested by @lhoestq, due to the enormous size of the dataset in .arrow format, the speech files are not directly prepared to a float32-array, but instead just the path to the array file is stored.

@patrickvonplaten patrickvonplaten changed the title add script [WIP] add script Jan 22, 2021
@patrickvonplaten patrickvonplaten changed the title [WIP] add script Add Librispeech ASR Jan 25, 2021
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you !

I added a few comments

datasets/librispeech_asr/README.md Outdated Show resolved Hide resolved
datasets/librispeech_asr/README.md Show resolved Hide resolved
datasets/librispeech_asr/librispeech_asr.py Show resolved Hide resolved
datasets/librispeech_asr/librispeech_asr.py Outdated Show resolved Hide resolved
datasets/librispeech_asr/librispeech_asr.py Show resolved Hide resolved
datasets/librispeech_asr/librispeech_asr.py Outdated Show resolved Hide resolved
patrickvonplaten and others added 6 commits January 25, 2021 15:45
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you !

The dummy data are quite big but it was expected given that the raw files are flac files.
Given that the script doesn't even read the flac files I think we can remove them. Or maybe use empty flac files (see here for example). What do you think ?

We'll find a better solution to be able to have bigger dummy_data (max 1MB instead of a few KB, maybe using git LFS.

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Jan 25, 2021

Awesome thank you !

The dummy data are quite big but it was expected given that the raw files are flac files.
Given that the script doesn't even read the flac files I think we can remove them. Or maybe use empty flac files (see here for example). What do you think ?

We'll find a better solution to be able to have bigger dummy_data (max 1MB instead of a few KB, maybe using git LFS.

Hmm, I already made the dummy data as small as possible (a single flac filie per split only). I'd like to keep them at least to have complete dummy data and don't think 500KB for all datasets together is a problem (the long-range summarization datasets are similarly heavy). The moment we allow dummy data to be loaded directly for testing, we need the flac files IMO.

But I agree that longterm, we need a better solution for the dummy data (maybe stop hosting it on github to not make the repo too heavy)

@patrickvonplaten patrickvonplaten merged commit 312a2d6 into huggingface:master Jan 25, 2021
@patrickvonplaten patrickvonplaten deleted the add_librispeech_asr branch January 25, 2021 20:38
@anton-l anton-l mentioned this pull request Feb 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants