This is a restructured and rewritten version of bshall/UniversalVocoding. The main difference here is that the model is turned into a TorchScript module during training and can be loaded for inferencing anywhere without Python dependencies.
Since the pretrained models were turned to TorchScript, you can load a trained model anywhere. Also you can generate multiple waveforms parallelly, e.g.
import torch
vocoder = torch.jit.load("vocoder.pt")
mels = [
torch.randn(100, 80),
torch.randn(200, 80),
torch.randn(300, 80),
] # (length, mel_dim)
with torch.no_grad():
wavs = vocoder.generate(mels)
Emperically, if you're using the default architecture, you can generate 30 samples at the same time on an GTX 1080 Ti.
Multiple directories containing audio files can be processed at the same time, e.g.
python preprocess.py \
VCTK-Corpus \
LibriTTS/train-clean-100 \
preprocessed # the output directory of preprocessed data
And train the model with the preprocessed data, e.g.
python train.py preprocessed
With the default settings, it would take around 12 hr to train to 100K steps on an RTX 2080 Ti.