-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LibriLightLimited dataset #2302
Conversation
b29931f
to
2326856
Compare
root = os.fspath(root) | ||
self._path = os.path.join(root, "librispeech_finetuning") | ||
archive = os.path.join(root, "librispeech_finetuning" + ".tgz") | ||
if download: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you throw a runtime error if the dataset can not be found locally and download = False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I notice in quesst14 dataset, it checks if the archive file is available, usually users may delete it and only keep the extracted files. Maybe change it to check the _path
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous comment doesn't make sense. after checking the code of quesst14,
if not os.path.isdir(self._path):
if not os.path.isfile(archive):
if not download:
raise RuntimeError("Dataset not found. Please use `download=True` to download")
download_url_to_file(URL, archive, hash_prefix=_CHECKSUM)
extract_archive(archive, root)
if the archived file is in user's local disk and self._path
is not here, it will extract the archived files for the user. It's correct when the archived file is the desired one. However, if the archived file is different, but with the same file name by coincidence, we may not get the expected extracted files. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nateanl hmm ok, so currently this is what happens for quesst14:
- if neither the folder or archive (.tgz) for the dataset exist, it will either throw a download error or correctly download and extract the files, which is the intended behavior
- if the folder doesn't exist but the archive exists, it will extract the archive files and therefore create the folder
for the second case, you're saying the archive file may not be correct and we extract incorrect files. To get around this, we could raise the download error if self._path
does not exist, which means the user would download and extract the files themselves, then pass in the root corresponding to it. should we expect the users to extract the files themselves, and take the approach above?
I do think this is an issue we can't really predict, since even if they do have a folder with the correct name, we have no way of telling if the contents of the folder are expected without performing added checks for folder structure/files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if they do have a folder with the correct name, we have no way of telling if the contents of the folder are expected without performing added checks for folder structure/files.
This is true. I sometimes met this issue myself. For example, I set a wrong root directory for LibriSpeech
dataset and there is an empty LibriSpeech
folder under it. Hence the length of the dataset is 0
. We can add some assertion checks, like check if the file list has the same number of audios as expected.
For the second case, your workaround is good. if self._path
doesn't exist and the archive file is there, we don't want to download the archive file again, so we can throw an error "archive file detected, please extract the existing archive and set download=False", which sounds intelligent?
c511aed
to
8da9c1c
Compare
_ext_txt = ".trans.txt" | ||
_ext_audio = ".flac" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reasoning behind making these values class variables while making _URL
and _CHECKSUM
module-level variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flac part particularly is so that in unittest we can mock the dataset with WAV files. In test, we do not want to use torchaudio's I/O module because it will make the test depend on not only the dataset implementation but also on I/O module.
Now, without our own I/O module, there aren't many tools that provide nice FLAC support. (PySoundFile can, but it also depends on installation)
So in test, we generate mock data with WAV format and overwrite the audio extension for the duration of test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see why .trans.txt
should be class variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For .trans.txt
we can make a separate PR to address for all datasets, such as LibriSpeech and CommonVoice.
8da9c1c
to
dc3d9ce
Compare
@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
The
LibriLightLimited
dataset is created for fine-tuning SSL models, such as Wav2Vec2 and HuBERT. It is a supervised subset of Libri-Light dataset. To distinguish the unsupervised subset and the supervised one, it's clearer to put it in a separate dataset class for fine-tuning purpose.It contains "10 min", "1 hour", "10 hour" splits.