-
Notifications
You must be signed in to change notification settings - Fork 213
Speech recognition data loading and augmentation #724
Comments
Hi @flozi00, thanks for your suggestions! Regarding augmentation support, transforms in flash are generally represented as a dict which maps hook names to the callable transforms to run on that hook. Any transforms you provide to the So your data loading with some train augmentations would look like this:
Currently, you're augmentation would have to apply to the output of soundfile, but if librosa enables more augmentations then that would be a welcome change 😃 The main other change that would be needed is to convert back to the required numpy format before applying the processor from HF here: https://github.com/PyTorchLightning/lightning-flash/blob/4ebc45dd74df50e0a8b8d8e92efb97c6c0b9f6cb/flash/audio/speech_recognition/collate.py#L70
Not sure on that one, perhaps @SeanNaren can answer? Looking forward to your PR 😃 |
Do you have an example of such an callable ?
librosa is the same as soundfile but powered by ffmpeg in background for files like .mp3 audiomentations, soundfile and librosa have all the correct format for the processor |
That's awesome, switch to librosa should definitely work then, happy to have a PR for it 😃
So in that case in the above example you could have:
Hope that helps 😃 |
That callable takes two inputs, the audio data and the sample rate |
Ah, in that case you could drop the
|
thank you, I will try that tomorrow when the actuall running training is done |
🚀 Feature
https://github.com/PyTorchLightning/lightning-flash/blob/4ebc45dd74df50e0a8b8d8e92efb97c6c0b9f6cb/flash/audio/speech_recognition/data.py#L72
At the moment the audio is getting loaded by soundfile, if it would be replaced with librosa more audio types are supported and resampling is done by default.
Motivation
More audio formats like MP3 would be supported.
Furthermore an augmentation of data would be cool for training, especially with low ressource domains or languages.
Pitch
Replacing soundfile with librosa and integrating audiomentations library.
Alternatives
Additional context
I am not familiar with lightning flash API, especially I dont know how to add augmentation for training only, when its the same method used for prediction and training.
I would like to add these two features, if you give some advice.
In the same step I could check adding Hubert, or replacing the hardcoded classes with automodel class, if it exists.
One more question:
What happens when I load an only pretrained model without feature extractor ? Does LF built its automatically ?
The text was updated successfully, but these errors were encountered: