Add LibriLightLimited dataset #2302

nateanl · 2022-03-31T11:25:38Z

The LibriLightLimited dataset is created for fine-tuning SSL models, such as Wav2Vec2 and HuBERT. It is a supervised subset of Libri-Light dataset. To distinguish the unsupervised subset and the supervised one, it's clearer to put it in a separate dataset class for fine-tuning purpose.
It contains "10 min", "1 hour", "10 hour" splits.

torchaudio/datasets/librispeech_finetune.py

torchaudio/datasets/librilight_limited.py

carolineechen · 2022-05-11T16:42:17Z

torchaudio/datasets/librilight_limited.py

+        root = os.fspath(root)
+        self._path = os.path.join(root, "librispeech_finetuning")
+        archive = os.path.join(root, "librispeech_finetuning" + ".tgz")
+        if download:


could you throw a runtime error if the dataset can not be found locally and download = False?

Thanks. I notice in quesst14 dataset, it checks if the archive file is available, usually users may delete it and only keep the extracted files. Maybe change it to check the _path?

My previous comment doesn't make sense. after checking the code of quesst14,

if not os.path.isdir(self._path): if not os.path.isfile(archive): if not download: raise RuntimeError("Dataset not found. Please use `download=True` to download") download_url_to_file(URL, archive, hash_prefix=_CHECKSUM) extract_archive(archive, root)

if the archived file is in user's local disk and self._path is not here, it will extract the archived files for the user. It's correct when the archived file is the desired one. However, if the archived file is different, but with the same file name by coincidence, we may not get the expected extracted files. what do you think?

@nateanl hmm ok, so currently this is what happens for quesst14:

if neither the folder or archive (.tgz) for the dataset exist, it will either throw a download error or correctly download and extract the files, which is the intended behavior

if the folder doesn't exist but the archive exists, it will extract the archive files and therefore create the folder

for the second case, you're saying the archive file may not be correct and we extract incorrect files. To get around this, we could raise the download error if self._path does not exist, which means the user would download and extract the files themselves, then pass in the root corresponding to it. should we expect the users to extract the files themselves, and take the approach above?

I do think this is an issue we can't really predict, since even if they do have a folder with the correct name, we have no way of telling if the contents of the folder are expected without performing added checks for folder structure/files.

even if they do have a folder with the correct name, we have no way of telling if the contents of the folder are expected without performing added checks for folder structure/files.

This is true. I sometimes met this issue myself. For example, I set a wrong root directory for LibriSpeech dataset and there is an empty LibriSpeech folder under it. Hence the length of the dataset is 0. We can add some assertion checks, like check if the file list has the same number of audios as expected.

For the second case, your workaround is good. if self._path doesn't exist and the archive file is there, we don't want to download the archive file again, so we can throw an error "archive file detected, please extract the existing archive and set download=False", which sounds intelligent?

torchaudio/datasets/librilight_limited.py

hwangjeff · 2022-05-11T19:03:58Z

torchaudio/datasets/librilight_limited.py

+    _ext_txt = ".trans.txt"
+    _ext_audio = ".flac"


what's the reasoning behind making these values class variables while making _URL and _CHECKSUM module-level variables?

The flac part particularly is so that in unittest we can mock the dataset with WAV files. In test, we do not want to use torchaudio's I/O module because it will make the test depend on not only the dataset implementation but also on I/O module.

Now, without our own I/O module, there aren't many tools that provide nice FLAC support. (PySoundFile can, but it also depends on installation)

So in test, we generate mock data with WAV format and overwrite the audio extension for the duration of test.

I do not see why .trans.txt should be class variable.

For .trans.txt we can make a separate PR to address for all datasets, such as LibriSpeech and CommonVoice.

torchaudio/datasets/librilight_limited.py

facebook-github-bot · 2022-05-13T21:44:33Z

@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-05-20T10:59:52Z

@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-05-23T08:05:14Z

@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot added the CLA Signed label Mar 31, 2022

mthrok reviewed Apr 4, 2022

View reviewed changes

torchaudio/datasets/librispeech_finetune.py Outdated Show resolved Hide resolved

mthrok reviewed Apr 4, 2022

View reviewed changes

torchaudio/datasets/librispeech_finetune.py Outdated Show resolved Hide resolved

mthrok reviewed Apr 4, 2022

View reviewed changes

torchaudio/datasets/librispeech_finetune.py Outdated Show resolved Hide resolved

mthrok reviewed Apr 4, 2022

View reviewed changes

torchaudio/datasets/librispeech_finetune.py Outdated Show resolved Hide resolved

nateanl changed the title ~~Add LibriSpeechFineTune dataset~~ Add LibriLightLimited dataset Apr 22, 2022

nateanl marked this pull request as ready for review April 22, 2022 13:27

nateanl requested review from hwangjeff, xiaohui-zhang and carolineechen April 22, 2022 13:28

nateanl added module: datasets new feature labels Apr 22, 2022

nateanl force-pushed the librispeech_finetune branch from b29931f to 2326856 Compare April 22, 2022 13:32

carolineechen reviewed Apr 22, 2022

View reviewed changes

torchaudio/datasets/librilight_limited.py Show resolved Hide resolved

torchaudio/datasets/librilight_limited.py Outdated Show resolved Hide resolved

nateanl requested review from mthrok and carolineechen April 26, 2022 12:52

carolineechen reviewed May 11, 2022

View reviewed changes

nateanl force-pushed the librispeech_finetune branch from c511aed to 8da9c1c Compare May 11, 2022 16:49

hwangjeff reviewed May 11, 2022

View reviewed changes

nateanl added 9 commits May 13, 2022 22:12

add LibriSpeechFineTune dataset

936416e

address comments, add unit test

06586f5

rename LibriSpeechFineTune to LibriLightLimited

ff7cf0b

fix

22cb381

address comment

73978b4

fix

a7ecbe0

fix

f5b7491

assert dataset is downloaded to root path

e096ee1

address comments

dc3d9ce

nateanl force-pushed the librispeech_finetune branch from 8da9c1c to dc3d9ce Compare May 13, 2022 21:43

fix

e3faf59

mthrok approved these changes May 19, 2022

View reviewed changes

fix lint

557064b

facebook-github-bot closed this in af9cab3 May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LibriLightLimited dataset #2302

Add LibriLightLimited dataset #2302

nateanl commented Mar 31, 2022 •

edited

Loading

carolineechen May 11, 2022

nateanl May 11, 2022

nateanl May 11, 2022 •

edited

Loading

carolineechen Jun 1, 2022

nateanl Jun 1, 2022

hwangjeff May 11, 2022

mthrok May 19, 2022

mthrok May 19, 2022

nateanl May 20, 2022

facebook-github-bot commented May 13, 2022

facebook-github-bot commented May 20, 2022

facebook-github-bot commented May 23, 2022

Add LibriLightLimited dataset #2302

Add LibriLightLimited dataset #2302

Conversation

nateanl commented Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateanl May 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented May 13, 2022

facebook-github-bot commented May 20, 2022

facebook-github-bot commented May 23, 2022

nateanl commented Mar 31, 2022 •

edited

Loading

nateanl May 11, 2022 •

edited

Loading