Help Finetune and cartoon voice? #298

BigBoF · 2024-08-08T00:55:42Z

BigBoF
Aug 8, 2024

Hello,

Sorry for my English, I use DeepL ^^

I've been searching for weeks, trying several Finetune methods to train French voices, I've asked chatgpt for advice, it doesn't work. Maybe someone can explain to me what's wrong?

Most voices work when they sound like a human voice.

When I try to train a high-pitched voice that sounds like a cartoon girl's voice (for example D.Va from Overwatch or Tracer), anything with a high-pitched voice has voice saturation when it accentuates sentences even without an exclamation mark. Occasionally, it's inaudible.

I tried several epoch settings (20 to 250), several learning rate settings, batch size and grad step.
Every time, it saturates.

I'm using Alltalk beta version.
I have an RTX 4070 TI 12 go Vram.
I have an i5 13600kf.

If I increase the batch size too much (beyond 4), it doesn't work anymore. Crash vram.
Either I can set 4 and 2 grad step or 2 batch size and 8 grad step.

The samples are of high studio quality, taken directly from the game.
I can go from 10 min to 40 min samples. Whether at 22050hz, mono, 16 bits, the problem remains the same.

What can I do?
I'm a beginner in this field, I can't find anything on the Internet.
Maybe I just don't have enough Vram for these cartoon voices?
Or Coqui can't work on this type of voice?
I should point out that for non-cartoon voices, it works without a hitch even with Alltalk's default settings.

Thank you very much for reading.

Studio quality :

Real.mp4

My trained voice, that's the problem no matter what the setting :

Test.mp4

erew123 · 2024-08-08T09:50:31Z

erew123
Aug 8, 2024
Maintainer

Hi @BigBoF

Training cartoon voices can be difficult, but I would think this voice would be close enough to a normal voice to work. Can I ask, are the voice samples you are using like the one above "Real.mp4"? As there is only 2-3 seconds of actual audio in the 10 second voice clip.

Typically for training, you will need audio samples that are at least 10 seconds long in length, and that means 10 seconds of speaking, not 2-3 seconds of speaking and then silence for the remaining 7-8 seconds. You would be best providing something like this https://www.youtube.com/watch?v=23HdsSDZMws (but the French version) downloaded and converted for the Step 1 of finetuning to sample and generate your dataset. Im only suggesting this as I dont know how you have setup your dataset or what audio you have used.

Additionally, when you are generating actual text to speech, you will need a sample that is 8-30 seconds of speech for the AI model to have enough to sample. Again this sample should be generally a full audio clip without a lot of silence, like the ones provided https://github.com/erew123/alltalk_tts/tree/main/voices decently long enough audio sections should be generated at the end of finetuning.

That being said, you will never get the exact emphasis of the original character voice.

Finally, a couple of us are working on adding an additional tokenizer, which should improve the quality and output of training, though its not ready yet, so I guess watch this space.

Thanks

2 replies

BigBoF Aug 8, 2024
Author

Hello @erew123 !

Thanks for your reply!

So about the training samples, it's exactly your video like Tracer. These are all the voicelines from the game compiled with Audacity.

On voices like "Widowmaker" from Overwatch, there's no problem.
It's really the cartoon voices that create this problem.

As for the real.mp4 sample, it's a bit longer, but that was just to show the quality of the basic voice.

I train the voice on 20-30 min of audio.
I generate text from speech, using 30 seconds of speech in as neutral a voice as possible (without her screaming or exerting herself, for example.).

For step 1, I used all the methods. The best is Whisperv2, there's no break between samples. With Whisperv3, there are breaks between words.

What's annoying is that it transforms 30-40 min of quality sample into 900-1000 samples of 2 to 5 seconds max (even for whisperv2).
Is this normal? Could it distort training because the samples are too short?

I hope you understand, my English isn't my strong suit.

I'm just trying to avoid saturation (Test.mp4).
I'm not trying to copy the voice perfectly because I'm using RVC behind it.
But the voice is so saturated that RVC can't work ^^.

EDIT :

I'll detail what I did with the settings to be more precise.

Photo 1: Step 1: As on the photo (default and 11 min test to try without shouting, D.Va's neutral voice compiled with Audacity, test in 22hz and 44 hz).
Photo 2: Samples made by WhisperV2 (no break between words but small samples of 1 to 5 seconds max).
Photo 3: By default. I changed the learning rate to try with 1e-6, 5e-4, 1e-5, 1e-4 same result.
Max permitted audio size in seconds: 10, 5, or 15.
Optimizer: Still AdamW.
Select the Model to train: 2.0.3 or 2.02, same result.
Learning Rate Scheduler(s): default or reduce on plateau, which I don't think works.
Batch size: 4 or 2
Grad accumulation steps: 2 or 8.
Epoch: 20 then 10 (same test up to 250 lol).

I always come back to the problem of my test.mp4
Tracer or D.Va.

erew123 Aug 10, 2024
Maintainer

Hi @BigBoF In principle everything you are doing is correct, though as mentioned, cartoon voices are difficult to train. I say they are difficult to train based on other peoples experiences, not my own personal attempts at doing it, just other peoples stories. e.g. this is someone on Reddit from 7 months ago. This is a general discussion on XTTS (not related to AllTalk) but generally about the Coqui XTTS model and finetuning it.

The more you train the model, yes the better it will get at those voices, based on things I have been told, but again, I do not have personal experience of training cartoon voices.

So a few thoughts I do have:

it is possible to over-train the model, which can cause voice artifacts. There is no perfect settings to say exactly how much training will be correct and sometimes lower can be better. However, it sounds like you have attempted that with 10 and 20 epochs.
Its possible the BPE tokenizer may help with this (see in here re fictional words and of the like) Alltalkbeta enhancements #255 It may help deal with other sounds inc cartoon voices, however, there is a re-work of the BPE tokenizer currently going on as we dont currently believe its doing what it should be doing. Someone is working on it (Please see the Alltalkbeta Pull Request) https://github.com/erew123/alltalk_tts/pulls and it will be merged in when we believe its working.
Finally, and perhaps most importantly, cleaning your dataset to ensure that what its in your metadata_eval.csv and metadata_train.csv is matching what is in the audio file will always yield the best results. The whisper process isnt perfect and doesn't always perfectly complete the dataset correctly, it is purely a best effort attempt to automatically build a dataset for you.

As such, I would suggest checking that whatever is in those files, matches what is being spoken in the audio file. The more precise the text matches the audio file, the better the training.

It may just be there is no resolution for this specific voice however. Obviously you can use an RVC modifier for the voice, and there are various tracer voices available on the https://voice-models.com/ or such websites.

Beyond that, I will be adding a couple of other voice cloning TTS engines when I get an opportunity, so those may also give different results.

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help Finetune and cartoon voice? #298

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Help Finetune and cartoon voice? #298

BigBoF Aug 8, 2024

Replies: 1 comment · 2 replies

erew123 Aug 8, 2024 Maintainer

BigBoF Aug 8, 2024 Author

erew123 Aug 10, 2024 Maintainer

BigBoF
Aug 8, 2024

Replies: 1 comment 2 replies

erew123
Aug 8, 2024
Maintainer

BigBoF Aug 8, 2024
Author

erew123 Aug 10, 2024
Maintainer