Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Speech recognition data loading and augmentation #724

Closed
flozi00 opened this issue Sep 3, 2021 · 6 comments · Fixed by #726
Closed

Speech recognition data loading and augmentation #724

flozi00 opened this issue Sep 3, 2021 · 6 comments · Fixed by #726
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@flozi00
Copy link
Contributor

flozi00 commented Sep 3, 2021

🚀 Feature

https://github.com/PyTorchLightning/lightning-flash/blob/4ebc45dd74df50e0a8b8d8e92efb97c6c0b9f6cb/flash/audio/speech_recognition/data.py#L72

At the moment the audio is getting loaded by soundfile, if it would be replaced with librosa more audio types are supported and resampling is done by default.

Motivation

More audio formats like MP3 would be supported.
Furthermore an augmentation of data would be cool for training, especially with low ressource domains or languages.

Pitch

Replacing soundfile with librosa and integrating audiomentations library.

Alternatives

Additional context

I am not familiar with lightning flash API, especially I dont know how to add augmentation for training only, when its the same method used for prediction and training.
I would like to add these two features, if you give some advice.
In the same step I could check adding Hubert, or replacing the hardcoded classes with automodel class, if it exists.

One more question:
What happens when I load an only pretrained model without feature extractor ? Does LF built its automatically ?

@flozi00 flozi00 added enhancement New feature or request help wanted Extra attention is needed labels Sep 3, 2021
@ethanwharris
Copy link
Collaborator

Hi @flozi00, thanks for your suggestions! Regarding augmentation support, transforms in flash are generally represented as a dict which maps hook names to the callable transforms to run on that hook. Any transforms you provide to the SpeechRecognitionData classmethod bound to the pre_tensor_transform hook will receive as input the output from the load_sample here: https://github.com/PyTorchLightning/lightning-flash/blob/4ebc45dd74df50e0a8b8d8e92efb97c6c0b9f6cb/flash/audio/speech_recognition/data.py#L75

So your data loading with some train augmentations would look like this:

from flash.core.data.transforms import ApplyToKeys

datamodule = SpeechRecognitionData.from_json(
    ...,
    train_transform={"pre_tensor_transform": ApplyToKeys("input", my_augmentation),
)

Currently, you're augmentation would have to apply to the output of soundfile, but if librosa enables more augmentations then that would be a welcome change 😃 The main other change that would be needed is to convert back to the required numpy format before applying the processor from HF here: https://github.com/PyTorchLightning/lightning-flash/blob/4ebc45dd74df50e0a8b8d8e92efb97c6c0b9f6cb/flash/audio/speech_recognition/collate.py#L70

What happens when I load an only pretrained model without feature extractor ? Does LF built its automatically ?

Not sure on that one, perhaps @SeanNaren can answer?

Looking forward to your PR 😃

@flozi00
Copy link
Contributor Author

flozi00 commented Sep 3, 2021

Do you have an example of such an callable ?
I would use audiomentations: https://github.com/iver56/audiomentations

but if librosa enables more augmentations then that would be a welcome change

librosa is the same as soundfile but powered by ffmpeg in background for files like .mp3
soundfile only supports files like .wav

audiomentations, soundfile and librosa have all the correct format for the processor
during the huggingface wav2vec sprint I tried a lot of these stuff and everything worked fine

@ethanwharris
Copy link
Collaborator

ethanwharris commented Sep 3, 2021

audiomentations, soundfile and librosa have all the correct format for the processor
during the huggingface wav2vec sprint I tried a lot of these stuff and everything worked fine

That's awesome, switch to librosa should definitely work then, happy to have a PR for it 😃

Do you have an example of such an callable ?
I would use audiomentations: https://github.com/iver56/audiomentations

So in that case in the above example you could have:

my_augmentation = Compose([
    AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5),
    TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),
    PitchShift(min_semitones=-4, max_semitones=4, p=0.5),
    Shift(min_fraction=-0.5, max_fraction=0.5, p=0.5),
])

Hope that helps 😃

@flozi00
Copy link
Contributor Author

flozi00 commented Sep 3, 2021

That callable takes two inputs, the audio data and the sample rate
Is it possible to pass both arguments to the function or do I need to built my own one around it ?

https://github.com/iver56/audiomentations/blob/8bbd26537545c946d306b099aed46edc7dad727a/audiomentations/core/composition.py#L41-L54

@ethanwharris
Copy link
Collaborator

ethanwharris commented Sep 3, 2021

Ah, in that case you could drop the ApplyToKeys in the above example and instead make a wrapper like this:

class Wrapper(nn.Module):
    def __init__(transform):
        super().__init__()
        self.transform = transform

    def forward(x):
        x["input"] = self.transform(samples=x["input"], sample_rate=x["metadata"]["sampling_rate"])
        return x

@flozi00
Copy link
Contributor Author

flozi00 commented Sep 3, 2021

thank you, I will try that tomorrow when the actuall running training is done

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants