Replies: 1 comment 2 replies
-
Hi @BigBoF Training cartoon voices can be difficult, but I would think this voice would be close enough to a normal voice to work. Can I ask, are the voice samples you are using like the one above "Real.mp4"? As there is only 2-3 seconds of actual audio in the 10 second voice clip. Typically for training, you will need audio samples that are at least 10 seconds long in length, and that means 10 seconds of speaking, not 2-3 seconds of speaking and then silence for the remaining 7-8 seconds. You would be best providing something like this https://www.youtube.com/watch?v=23HdsSDZMws (but the French version) downloaded and converted for the Step 1 of finetuning to sample and generate your dataset. Im only suggesting this as I dont know how you have setup your dataset or what audio you have used. Additionally, when you are generating actual text to speech, you will need a sample that is 8-30 seconds of speech for the AI model to have enough to sample. Again this sample should be generally a full audio clip without a lot of silence, like the ones provided https://github.com/erew123/alltalk_tts/tree/main/voices decently long enough audio sections should be generated at the end of finetuning. That being said, you will never get the exact emphasis of the original character voice. Finally, a couple of us are working on adding an additional tokenizer, which should improve the quality and output of training, though its not ready yet, so I guess watch this space. Thanks |
Beta Was this translation helpful? Give feedback.
-
Hello,
Sorry for my English, I use DeepL ^^
I've been searching for weeks, trying several Finetune methods to train French voices, I've asked chatgpt for advice, it doesn't work. Maybe someone can explain to me what's wrong?
Most voices work when they sound like a human voice.
When I try to train a high-pitched voice that sounds like a cartoon girl's voice (for example D.Va from Overwatch or Tracer), anything with a high-pitched voice has voice saturation when it accentuates sentences even without an exclamation mark. Occasionally, it's inaudible.
I tried several epoch settings (20 to 250), several learning rate settings, batch size and grad step.
Every time, it saturates.
I'm using Alltalk beta version.
I have an RTX 4070 TI 12 go Vram.
I have an i5 13600kf.
If I increase the batch size too much (beyond 4), it doesn't work anymore. Crash vram.
Either I can set 4 and 2 grad step or 2 batch size and 8 grad step.
The samples are of high studio quality, taken directly from the game.
I can go from 10 min to 40 min samples. Whether at 22050hz, mono, 16 bits, the problem remains the same.
What can I do?
I'm a beginner in this field, I can't find anything on the Internet.
Maybe I just don't have enough Vram for these cartoon voices?
Or Coqui can't work on this type of voice?
I should point out that for non-cartoon voices, it works without a hitch even with Alltalk's default settings.
Thank you very much for reading.
Studio quality :
Real.mp4
My trained voice, that's the problem no matter what the setting :
Test.mp4
Beta Was this translation helpful? Give feedback.
All reactions