Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LibriLightLimited dataset #2302

Closed
wants to merge 11 commits into from
Closed

Conversation

nateanl
Copy link
Member

@nateanl nateanl commented Mar 31, 2022

The LibriLightLimited dataset is created for fine-tuning SSL models, such as Wav2Vec2 and HuBERT. It is a supervised subset of Libri-Light dataset. To distinguish the unsupervised subset and the supervised one, it's clearer to put it in a separate dataset class for fine-tuning purpose.
It contains "10 min", "1 hour", "10 hour" splits.

@nateanl nateanl changed the title Add LibriSpeechFineTune dataset Add LibriLightLimited dataset Apr 22, 2022
@nateanl nateanl marked this pull request as ready for review April 22, 2022 13:27
root = os.fspath(root)
self._path = os.path.join(root, "librispeech_finetuning")
archive = os.path.join(root, "librispeech_finetuning" + ".tgz")
if download:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you throw a runtime error if the dataset can not be found locally and download = False?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I notice in quesst14 dataset, it checks if the archive file is available, usually users may delete it and only keep the extracted files. Maybe change it to check the _path?

Copy link
Member Author

@nateanl nateanl May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comment doesn't make sense. after checking the code of quesst14,

if not os.path.isdir(self._path):
    if not os.path.isfile(archive):
        if not download:
            raise RuntimeError("Dataset not found. Please use `download=True` to download")
        download_url_to_file(URL, archive, hash_prefix=_CHECKSUM)
    extract_archive(archive, root)

if the archived file is in user's local disk and self._path is not here, it will extract the archived files for the user. It's correct when the archived file is the desired one. However, if the archived file is different, but with the same file name by coincidence, we may not get the expected extracted files. what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nateanl hmm ok, so currently this is what happens for quesst14:

  • if neither the folder or archive (.tgz) for the dataset exist, it will either throw a download error or correctly download and extract the files, which is the intended behavior
  • if the folder doesn't exist but the archive exists, it will extract the archive files and therefore create the folder

for the second case, you're saying the archive file may not be correct and we extract incorrect files. To get around this, we could raise the download error if self._path does not exist, which means the user would download and extract the files themselves, then pass in the root corresponding to it. should we expect the users to extract the files themselves, and take the approach above?

I do think this is an issue we can't really predict, since even if they do have a folder with the correct name, we have no way of telling if the contents of the folder are expected without performing added checks for folder structure/files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if they do have a folder with the correct name, we have no way of telling if the contents of the folder are expected without performing added checks for folder structure/files.

This is true. I sometimes met this issue myself. For example, I set a wrong root directory for LibriSpeech dataset and there is an empty LibriSpeech folder under it. Hence the length of the dataset is 0. We can add some assertion checks, like check if the file list has the same number of audios as expected.

For the second case, your workaround is good. if self._path doesn't exist and the archive file is there, we don't want to download the archive file again, so we can throw an error "archive file detected, please extract the existing archive and set download=False", which sounds intelligent?

torchaudio/datasets/librilight_limited.py Outdated Show resolved Hide resolved
torchaudio/datasets/librilight_limited.py Outdated Show resolved Hide resolved
Comment on lines +76 to +58
_ext_txt = ".trans.txt"
_ext_audio = ".flac"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reasoning behind making these values class variables while making _URL and _CHECKSUM module-level variables?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flac part particularly is so that in unittest we can mock the dataset with WAV files. In test, we do not want to use torchaudio's I/O module because it will make the test depend on not only the dataset implementation but also on I/O module.

Now, without our own I/O module, there aren't many tools that provide nice FLAC support. (PySoundFile can, but it also depends on installation)

So in test, we generate mock data with WAV format and overwrite the audio extension for the duration of test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see why .trans.txt should be class variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For .trans.txt we can make a separate PR to address for all datasets, such as LibriSpeech and CommonVoice.

torchaudio/datasets/librilight_limited.py Outdated Show resolved Hide resolved
torchaudio/datasets/librilight_limited.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@nateanl has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants