Would training StyleTTS2 using the new LibriTTS-R data set help quality ? #123
Replies: 2 comments 13 replies
-
Considering that the raw LibriTTS dataset has a quite bad audio quality - if you want more professional sounding voice quality, sure. But. I do wonder if using the dirty original dataset isn't a better benchmark for TTS model robustness. If it learns to produce sensible speech with those input audio samples, that does indicate a certain tolerance to deal with bad input data. And if the goal is to do zero-shot voice cloning later on, it is reasonable to assume that the audio clips people will use as reference are going to be far from ideal quality too. Perhaps mixing some dirty original speakers with cleaned speakers could be advantageous. Someone would have to test it. What I've noticd with various TTS architectures over the years is that if it can find alignment with challenging datasets, it usually performs better with unexpected inputs that it didn't see during training, later on. One thing that is noteworthy is the libritts_r_failed_speech_restoration_examples.tar.gz file that they offer with the cleaned dataset. That list indicates bad samples that may have misaligned transcripts or poor cleaning results. Might be reasonable to exclude those. Misaligned/incorrect transcripts can mess up a model far worse than poor audio quality. And if the same errors also exist in the uncleaned LibriTTS dataset, well. Worth a look for anyone who uses those. You could finetune it on the cleaned dataset if you're curious, and see how that influences things. 🤔 |
Beta Was this translation helpful? Give feedback.
-
I found a model that uses the LibriTTS-R data. |
Beta Was this translation helpful? Give feedback.
-
https://google.github.io/df-conformer/librittsr/ Any plans to get a model using this data set.
Beta Was this translation helpful? Give feedback.
All reactions