-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Librispeech ASR #1767
Add Librispeech ASR #1767
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thank you !
I added a few comments
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
…aten/datasets-1 into add_librispeech_asr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you !
The dummy data are quite big but it was expected given that the raw files are flac files.
Given that the script doesn't even read the flac files I think we can remove them. Or maybe use empty flac files (see here for example). What do you think ?
We'll find a better solution to be able to have bigger dummy_data (max 1MB instead of a few KB, maybe using git LFS.
Hmm, I already made the dummy data as small as possible (a single flac filie per split only). I'd like to keep them at least to have complete dummy data and don't think 500KB for all datasets together is a problem (the long-range summarization datasets are similarly heavy). The moment we allow dummy data to be loaded directly for testing, we need the flac files IMO. But I agree that longterm, we need a better solution for the dummy data (maybe stop hosting it on github to not make the repo too heavy) |
This PR adds the librispeech asr dataset: https://www.tensorflow.org/datasets/catalog/librispeech
There are 2 configs: "clean" and "other" whereas there are two "train" datasets for "clean", hence the name "train.100" and "train.360".
As suggested by @lhoestq, due to the enormous size of the dataset in
.arrow
format, the speech files are not directly prepared to a float32-array, but instead just the path to the array file is stored.