Would training StyleTTS2 using the new LibriTTS-R data set help quality ? #123

vishalsantoshi · 2023-12-01T16:18:25Z

vishalsantoshi
Dec 1, 2023

https://google.github.io/df-conformer/librittsr/ Any plans to get a model using this data set.

Kreevoz · 2023-12-01T17:34:54Z

Kreevoz
Dec 1, 2023

Considering that the raw LibriTTS dataset has a quite bad audio quality - if you want more professional sounding voice quality, sure.

But. I do wonder if using the dirty original dataset isn't a better benchmark for TTS model robustness. If it learns to produce sensible speech with those input audio samples, that does indicate a certain tolerance to deal with bad input data. And if the goal is to do zero-shot voice cloning later on, it is reasonable to assume that the audio clips people will use as reference are going to be far from ideal quality too.

Perhaps mixing some dirty original speakers with cleaned speakers could be advantageous. Someone would have to test it. What I've noticd with various TTS architectures over the years is that if it can find alignment with challenging datasets, it usually performs better with unexpected inputs that it didn't see during training, later on.

One thing that is noteworthy is the libritts_r_failed_speech_restoration_examples.tar.gz file that they offer with the cleaned dataset. That list indicates bad samples that may have misaligned transcripts or poor cleaning results. Might be reasonable to exclude those. Misaligned/incorrect transcripts can mess up a model far worse than poor audio quality. And if the same errors also exist in the uncleaned LibriTTS dataset, well. Worth a look for anyone who uses those.

You could finetune it on the cleaned dataset if you're curious, and see how that influences things. 🤔
Generally the audio quality of the LibriTTS base checkpoint isn't much of a concern if you finetune on your own good quality dataset later on anyway.

13 replies

vishalsantoshi Dec 5, 2023
Author

Yes @yl4579. I am however not able to reconcile how fine tuning on a 10 minute of data sample would improve zero-shot cloning of a LJSpeech OR LibriTTS data set based model, where the latter has at least is a 570 plus hours of voice sample, till we have some specific tuning parameter set . and thus I asked. May be I am missing something.

Of course I am assuming we are fine tuning an off the shelf model which you have so graciously already fully trained based oin the COnfigs/config.yml.

That said I will try fine tuning and use zero shot cloning and let you know.

vishalsantoshi Dec 5, 2023
Author

https://huggingface.co/yl4579/StyleTTS2-LibriTTS/tree/main/Models/LibriTTS is the checkpoint and that seems to have been only trained for a 20 epochs ( if I understand the naming correctly epochs_2nd_00020 ) and may be a smaller sample and thus needs further tuning ?

vishalsantoshi Dec 5, 2023
Author

Ok Fine Tuned with

config['batch_size'] = 2 # not enough RAM
config['max_len'] = 100 # not enough RAM
config['loss_params']['joint_epoch'] = 110 # we do not do SLM adversarial training due to not enough RAM

50 epochs, roughly 4-5 hrs, 200 samples ( roughy 15 minutes ) , essentially the defaults. Did see some similarity with alpha and beat reduced, but not acceptable high quality on zero shot. I guess the max_len is too low. i was on a 16GB CUDA. I guess need to get a beefier A/V (100 ) with at least 80GB VRAM to get to max_len of 400 and larger batch size as well as kicking in SLM adversarial training .

Is that what you used @yl4579 ? As in what was your config ?

yl4579 Dec 6, 2023
Maintainer

No, this is the model trained on LibriTTS which has 585 hours of data, and 20 epochs means it has seen these 585 hours of audio for 20 times. Finetuning means you are training the model again on speakers not included in these 585 hours of data.

vishalsantoshi Dec 6, 2023
Author

Thank you @yl4579 .

So far finetuning with as small as 10 minutes of data significantly improves the speaker similarity on my side

Was the data, the LJSpeech data set ? I guess the default tuning note book uses 200 samples but they are all single speaker from LJSpeech. A multi speaker data set with a beefy GPU that allows for greater max_len and batch_size along with SLM adversarial training will help zero shot, is what it seems as I did not see appreciable quality increase on zero shot with the current COLAB default tuning notebook. And thus was curious about the data set you used.

GUUser91 · 2024-04-27T21:45:54Z

GUUser91
Apr 27, 2024

I found a model that uses the LibriTTS-R data.
https://huggingface.co/ShoukanLabs/Vokan

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would training StyleTTS2 using the new LibriTTS-R data set help quality ? #123

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 13 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Would training StyleTTS2 using the new LibriTTS-R data set help quality ? #123

vishalsantoshi Dec 1, 2023

Replies: 2 comments · 13 replies

Kreevoz Dec 1, 2023

vishalsantoshi Dec 5, 2023 Author

vishalsantoshi Dec 5, 2023 Author

vishalsantoshi Dec 5, 2023 Author

yl4579 Dec 6, 2023 Maintainer

vishalsantoshi Dec 6, 2023 Author

GUUser91 Apr 27, 2024

vishalsantoshi
Dec 1, 2023

Replies: 2 comments 13 replies

Kreevoz
Dec 1, 2023

vishalsantoshi Dec 5, 2023
Author

vishalsantoshi Dec 5, 2023
Author

vishalsantoshi Dec 5, 2023
Author

yl4579 Dec 6, 2023
Maintainer

vishalsantoshi Dec 6, 2023
Author

GUUser91
Apr 27, 2024