-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train other models in the pipeline #3
Comments
Might be relevant https://github.com/yuan1615/AdaVocoder |
DIFFUSION TRAINING PROGRESS
|
Did the new configs and changes improve the diffusion model training? |
What I did was to try to train the diffusion model on top of a fairly broken gpt fine-tune... which was evidently a bad idea; I couldn't tell whether it was significantly better or not. I vaguely think "it works" but honestly I should figure out how to enable the FID eval metrics first. |
Hi, is this still ongoing? I was trying to train the diffusion model from the template yaml( |
nope this entire repo + project is dead (i got poached) xtts seems at least marginally better, i'd just ask around coqui how to train stuff |
Apart from the GPT model (which has been implemented), there are 4 other models in TorToiSe that could be fine-tuned:
IMO, the diffusion model + vocoder are obvious targets. Vocoders are often fine-tuned in other tts pipelines, and the diffusion model serves roughly the same purpose...
...but, the diffusion model is the only other model that takes the conditioning latents into account. I suspect that fine-tuning both the autoregressive & diffuser models on a single speaker would lead to a kind of 'mode collapse' (bear with this inaccurate phrasing), where the conditioning latents fail to affect the output speech substantially. Ideally, some form of mixed speaker training would account for this, but I'm not sure how to accomplish that yet.
Training the VQVAE could be good for datasets that are emotional, and substantially different from the normal LJSpeech+libretts+commonvoice+voxpopuli+... pile of monotonic speech. But I think it would necessitate a parallel training of the GPT model + the CLVP model as well, to account for the change in tokens outputted.
I also think that keeping the CLVP model untrained could be a good idea to retain the power of conditioning latents. Fine-tuning it on a single voice would adjust it to see that specific speaker as more likely than other speakers.
The text was updated successfully, but these errors were encountered: