Skip to content
Eren Gölge edited this page Aug 10, 2020 · 7 revisions

TTS is a deep learning based text2speech solution. It favors simplicity over complex and large models and yet, it aims to achieve the state of the art results.

Based on user study, TTS is able to give on par or better performance compared to other commercial and open-source text2speech solutions. It also supports various languages and already applied to more than 13 different languages by our community.

TTS system we use comprises two separate deep neural networks. The first network computes acoustic features from given text input. The second network produces the voice from the computed acoustic features. We call the first model as "text2feat" and the second "vocoder".

Currently, we propose two text2feat model architectures, plotted on Tacotron and Tacotron2. We also introduce many other advanced techniques to the original models to improve the overall model performance. Some of the techniques are;

Tacotron based model is smaller and targets faster training/inference whereas Tacotron2 based model is almost 3 times larger but achieves better results by using a neural vocoder (WaveRNN, WaveNet, etc.). Be mindful to choose the right architecture serving your needs.

We provide different vocoder networks. Currently, we provide MelGAN, Multi-Band MelGAN, and ParallelWaveGAN models. These models are our choices since they provide much easier training and faster inference compared to their counterparts (WaveNet, WaveRNN, etc.). Combining these vocoders with our text2feat models, you can achieve real-time speech synthesis on both GPU and CPU platforms.

MelGAN and ParallelWaveGAN come with a trade-off between quality and speed. MelGAN is almost 5 times faster than ParallelWaveGAN however, depending on the dataset, ParallelWaveGAN provides better quality. You can pick the right vocoder network model in accordance with your requirements as the text2feat model. To sum up, MultiBand-Melgan with Tacotron provides the fastest run-time and ParallelWaveGAN with Tacotron2 provides the best quality.

Clone this wiki locally