Speech recognition data loading and augmentation #724

flozi00 · 2021-09-03T07:17:54Z

🚀 Feature

https://github.com/PyTorchLightning/lightning-flash/blob/4ebc45dd74df50e0a8b8d8e92efb97c6c0b9f6cb/flash/audio/speech_recognition/data.py#L72

At the moment the audio is getting loaded by soundfile, if it would be replaced with librosa more audio types are supported and resampling is done by default.

Motivation

More audio formats like MP3 would be supported.
Furthermore an augmentation of data would be cool for training, especially with low ressource domains or languages.

Pitch

Replacing soundfile with librosa and integrating audiomentations library.

Alternatives

Additional context

I am not familiar with lightning flash API, especially I dont know how to add augmentation for training only, when its the same method used for prediction and training.
I would like to add these two features, if you give some advice.
In the same step I could check adding Hubert, or replacing the hardcoded classes with automodel class, if it exists.

One more question:
What happens when I load an only pretrained model without feature extractor ? Does LF built its automatically ?

ethanwharris · 2021-09-03T09:57:01Z

Hi @flozi00, thanks for your suggestions! Regarding augmentation support, transforms in flash are generally represented as a dict which maps hook names to the callable transforms to run on that hook. Any transforms you provide to the SpeechRecognitionData classmethod bound to the pre_tensor_transform hook will receive as input the output from the load_sample here: https://github.com/PyTorchLightning/lightning-flash/blob/4ebc45dd74df50e0a8b8d8e92efb97c6c0b9f6cb/flash/audio/speech_recognition/data.py#L75

So your data loading with some train augmentations would look like this:

from flash.core.data.transforms import ApplyToKeys

datamodule = SpeechRecognitionData.from_json(
    ...,
    train_transform={"pre_tensor_transform": ApplyToKeys("input", my_augmentation),
)

Currently, you're augmentation would have to apply to the output of soundfile, but if librosa enables more augmentations then that would be a welcome change 😃 The main other change that would be needed is to convert back to the required numpy format before applying the processor from HF here: https://github.com/PyTorchLightning/lightning-flash/blob/4ebc45dd74df50e0a8b8d8e92efb97c6c0b9f6cb/flash/audio/speech_recognition/collate.py#L70

What happens when I load an only pretrained model without feature extractor ? Does LF built its automatically ?

Not sure on that one, perhaps @SeanNaren can answer?

Looking forward to your PR 😃

flozi00 · 2021-09-03T13:56:05Z

Do you have an example of such an callable ?
I would use audiomentations: https://github.com/iver56/audiomentations

but if librosa enables more augmentations then that would be a welcome change

librosa is the same as soundfile but powered by ffmpeg in background for files like .mp3
soundfile only supports files like .wav

audiomentations, soundfile and librosa have all the correct format for the processor
during the huggingface wav2vec sprint I tried a lot of these stuff and everything worked fine

ethanwharris · 2021-09-03T14:01:12Z

audiomentations, soundfile and librosa have all the correct format for the processor
during the huggingface wav2vec sprint I tried a lot of these stuff and everything worked fine

That's awesome, switch to librosa should definitely work then, happy to have a PR for it 😃

Do you have an example of such an callable ?
I would use audiomentations: https://github.com/iver56/audiomentations

So in that case in the above example you could have:

my_augmentation = Compose([
    AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5),
    TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),
    PitchShift(min_semitones=-4, max_semitones=4, p=0.5),
    Shift(min_fraction=-0.5, max_fraction=0.5, p=0.5),
])

Hope that helps 😃

flozi00 · 2021-09-03T14:08:42Z

That callable takes two inputs, the audio data and the sample rate
Is it possible to pass both arguments to the function or do I need to built my own one around it ?

https://github.com/iver56/audiomentations/blob/8bbd26537545c946d306b099aed46edc7dad727a/audiomentations/core/composition.py#L41-L54

ethanwharris · 2021-09-03T14:14:05Z

Ah, in that case you could drop the ApplyToKeys in the above example and instead make a wrapper like this:

class Wrapper(nn.Module):
    def __init__(transform):
        super().__init__()
        self.transform = transform

    def forward(x):
        x["input"] = self.transform(samples=x["input"], sample_rate=x["metadata"]["sampling_rate"])
        return x

flozi00 · 2021-09-03T14:18:20Z

thank you, I will try that tomorrow when the actuall running training is done

flozi00 added enhancement New feature or request help wanted Extra attention is needed labels Sep 3, 2021

flozi00 mentioned this issue Sep 3, 2021

replace soundfile with librosa #726

Merged

8 tasks

ethanwharris closed this as completed in #726 Sep 6, 2021

dayvagrant mentioned this issue Dec 21, 2021

Can't add augmentation to speech recognition. #1082

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech recognition data loading and augmentation #724

Speech recognition data loading and augmentation #724

flozi00 commented Sep 3, 2021

ethanwharris commented Sep 3, 2021

flozi00 commented Sep 3, 2021

ethanwharris commented Sep 3, 2021 •

edited

Loading

flozi00 commented Sep 3, 2021

ethanwharris commented Sep 3, 2021 •

edited

Loading

flozi00 commented Sep 3, 2021

Speech recognition data loading and augmentation #724

Speech recognition data loading and augmentation #724

Comments

flozi00 commented Sep 3, 2021

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

ethanwharris commented Sep 3, 2021

flozi00 commented Sep 3, 2021

ethanwharris commented Sep 3, 2021 • edited Loading

flozi00 commented Sep 3, 2021

ethanwharris commented Sep 3, 2021 • edited Loading

flozi00 commented Sep 3, 2021

ethanwharris commented Sep 3, 2021 •

edited

Loading

ethanwharris commented Sep 3, 2021 •

edited

Loading