StyleTTS2 Training from Scratch Notebooks #144
Replies: 4 comments 41 replies
-
Thank you for this! Can't wait to try it. How much of a corpus did you need to train from scratch (minutes/hours of audio) and how long did it take to get a production quality result, when training from scratch? (# of epochs? hours? And on what hardware?) |
Beta Was this translation helpful? Give feedback.
-
I thought that I'd shed some light on the progress so far. It's been a bit slow, especially since I was preparing all of the WAV files and training / validation text by hand to minimize error rate. I currently have around 10 hours of audio ready in 8534 WAV files, spanning from 1 to 22 seconds. At the end, I decided to give it a try with what I have, since the weeks cutting and syncing WAVS to text were really getting boring. I used VAST.ai for GPU rental and got the first stage trained up to 200 epochs in about 3,5 hours. The hardware and settings I used for this training were as follows:
The second stage is still in progress, currently at epoch 67. It's taking considerably longer because it can only run on a single GPU ( due to bug #7 ). Therefore, I'm running it every day for about 12-18 hours which brought the budget up to almost $200 by now. I decided to run it only when I have control over the process, since I didn't know how many epochs I can use that the A100 memory can handle. At the end, I used the following parameters for training:
Example of how inference sounds at epoch 65 as compared to the original TTS voice: https://jmp.sh/aWMQe69G |
Beta Was this translation helpful? Give feedback.
-
I couldn't increase the batch size per a100 by more than 8 when training. How did you manage to advance learning with such a large batch size? |
Beta Was this translation helpful? Give feedback.
-
hello. @martinambrus , Nice to meet you. I have just saw your old posts here. |
Beta Was this translation helpful? Give feedback.
-
I'm currently learning how to train a custom StyleTTS2 model from scratch.
I'm very new to this and thanks to this amazing project and its community, I've already gained a considerable amount of knowledge. Here, I'd like to share that knowledge with you.
To that end, I created 2 Jupyter Notebooks that I use for my own model training and audio samples preparation.
I'd like to stress out that these are only my own methods and findings and there probably is a better way to do things, especially in the audio preparation part. But since I tried - and failed - to find a reliable automated method there, some manual fine-tuning steps are still required in order to perfect the audio input.
My 2 Notebooks:
These notebooks can be used on Google Colab but also outside of it, on a dedicated cloud machine running Jupyter.
Perhaps I should clarify that I'm still in the process of learning and as such, I'm yet to create a production-ready model.See my last update below for more information on a production-ready model training and results.
These Notebooks previously served me to create a low-quality model from ~150 WAV files with a duration of 1 - 2.5 seconds as a proof of concept. At the present time, I use them to finalize my production-ready model training.
To that end, I used 2 TensorDock Cloud GPU Machines:
for the 1st training phase, a beefy one (costing approx. $3.38/hr) with:
I've done 1000 epochs, from which only the first 400 were needed - at least from what I can see in the validation loss info (around 0.38 - 0.4). Those 400 epochs were done very quickly (in less then 1 or 2 hours, if I recall correctly) on this small data set.
for the 2nd training phase, I used a similar machine but with only a single GPU, since there is still a bug in code that doesn't allow us to use DDP (accelerate) training in this phase. The cost went down to approx. $0.65/hr and it took about 2-4 hours to finish 100 epochs on this small data set.
I'm currently working towards the creation of a large high-quality corpus, spanning approx. 45 hours of audio.
The source for this corpus are 4 audio-books read by another high-quality (now decomissioned) TTS voice to which I had a commercial license.
My goal is to try and train a similarly-sounding voice model by utilizing StyleTTS2.
Here is an example of how the original TTS voice sounds like: https://jmp.sh/zOrGoel3
I should also mention that I already used the set of those ~150 WAV files to fine-tune StyleTTS2 to the new voice, and even with as little data as those files provide, I was able to achieve a very good voice transfer quality.
I hope these Notebooks will help someone to automate their training, too :)
Beta Was this translation helpful? Give feedback.
All reactions