How to Make Finetuning Dataset #92

fakerybakery · 2023-11-23T02:21:55Z

fakerybakery
Nov 23, 2023

Hi, for the finetuning dataset, should we use Whisper -> Phonemizer to make it from a list of audio files?

Kreevoz · 2023-11-23T03:04:23Z

Kreevoz
Nov 23, 2023

The one thing that would be of a real utility value is to give users the option to provide non-phonemized audio.wav|transcript in normal english list files, and then handle the phonemization (and maybe caching) for them so it matches the requirements of StyleTTS2 exactly.

Splitting out the phonemization into its own utility function and not repeating it verbatim in the first 4 lines of every inference function definition also would make sense. Then it could be called from anywhere including at auto-dataset generation.

You can't blindly trust whisper's transcripts though. I've run a bunch of larger datasets over it and it makes enough insanely stupid mistakes at times (even with the large model) that you will negatively impact your training results if you don't fix it up by hand and with careful listening.
You definitely want to double-check the punctuation it generates also and terminate sentences properly.

0 replies

yl4579 · 2023-11-23T20:10:04Z

yl4579
Nov 23, 2023
Maintainer

@Kreevoz One can use https://github.com/jaywalnut310/vits/blob/main/preprocess.py to generate the phonemes.

0 replies

devidw · 2023-11-23T21:02:14Z

devidw
Nov 23, 2023

working on a pipeline to easily allow building a compatible dataset, https://github.com/devidw/dswav

its a gradio ui that allows you to transcribe an input audio and have it be split into samples based on detected sentences and also builds required files for training

as @Kreevoz noted, whisper is a source of potential issues, if not carefully checking

also splitting at sentences seems not ideal, since sometimes there will be artifacts at the end in the chucked audio samples, doing some sort of splitting based on silence would prob be the better approach

0 replies

fakerybakery · 2023-11-23T21:07:16Z

fakerybakery
Nov 23, 2023
Author

Thanks for sharing your tool! Would you mind adding a license to it?

0 replies

devidw · 2023-11-23T21:12:05Z

devidw
Nov 23, 2023

sure, added @fakerybakery

0 replies

devidw · 2023-11-25T13:43:03Z

devidw
Nov 25, 2023

One can use jaywalnut310/vits@main/preprocess.py to generate the phonemes.

Hey @yl4579, thx sharing this. I'm trying to replicate this in order to build custom fine-tune datasets, however, when I use the shared script the output looks different from the training data shared in this repo.

For example, for LJ015-0030, in Data/train_list.txt its:

ðə bˈæŋk hɐdbɪŋ kəndˈʌktᵻd ˌɔn fˈɔls pɹˈɪnsɪpəlz ;

While when I look up the source text

The bank had been conducted on false principles;

And pipe it into the vits scripts, I get this:

- ðə bˈæŋk hɐdbɪŋ kəndˈʌktᵻd ˌɔn fˈɔls pɹˈɪnsɪpəlz ;
+ ðə bˈæŋk hɐdbɪn kəndˈʌktᵻd ˌɑːn fˈɑːls pɹˈɪnsɪpəlz;

Using default arguments, in this case english_cleaners2, english_cleaners produces:

- ðə bˈæŋk hɐdbɪŋ kəndˈʌktᵻd ˌɔn fˈɔls pɹˈɪnsɪpəlz ;
+ ðə bæŋk hɐdbɪn kəndʌktᵻd ɑːn fɑːls pɹɪnsɪpəlz

Any idea why the output might look different to the one in the repo?

I guess it's quite important that we exactly match the formatting like you used, when we fine-tune on the shared checkpoints, right?

0 replies

Kreevoz · 2023-11-25T15:01:06Z

Kreevoz
Nov 25, 2023

I can chime in on that. You need to modify the english_cleaners functions if you want to use them as they are. Remove the phonemization step from them and instead split that out and re-use the code from StyleTTS2.

StyleTTS2 does next to no text cleaning before the input texts are sent to the phonemizer. The phonemizer strips out most of the junk by itself (that's generally not what you want because you have little control over that).

This is what you'd need to phonemize correctly, taken from StyleTTS2:

from nltk.tokenize import word_tokenize

import phonemizer
global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True,  with_stress=True)

def text_to_phonemes(text):
  text = text.strip()
  ps = global_phonemizer.phonemize([text])
  ps = word_tokenize(ps[0])
  ps = ' '.join(ps)
  return ps

You then run the english_cleaners function on your input text first, followed by the text_to_phonemes. The result will be identical phonemization to the texts you see in the demo files that ship with StyleTTS2 (though you may find different punctuation if you keep the example cleaners from the other repo. If you want a 100% match, send the text directly to the text_to_phonemes step).

As for having to match the exact same formatting, that is not necessary. As long as you phonemize it in the same way you can preprocess the text differently. I've run a couple of finetunes over the past day and replaced the text cleaning with my own more aggressive implementation which adjusts punctuation to a format that the phonemizer likes and preserves.

It only matters that you have the exact same text cleaners at inference time too. The finetuned model will adapt to the new style of punctuation and formatting. That way it will also shed the habit of ignoring punctuation that the current pretrained models exhibit. That's because the datasets they were trained on don't have pauses in the audio when there is punctuation to indicate there should be. That behavior can be finetuned out of the model again to increase controllability.

0 replies

devidw · 2023-11-25T15:28:21Z

devidw
Nov 25, 2023

Thanks a ton @Kreevoz 🙌

I just run the input through the StyleTTS2 phonemes function, and it looks close.

https://gist.github.com/devidw/1bb5cd4d9d524218db22d6b0b10b6712

There is a minor difference in 2 phonemes tho I think, no idea where that might be coming from. Not sure if this could be something due to different version/os builds of espeak

Tested on

speak text-to-speech: 1.48.03 04.Mar.14 Data at: /opt/homebrew/Cellar/espeak/1.48.04_1/share/espeak-data

and

eSpeak text-to-speech: 1.48.03 04.Mar.14 Data at: /usr/lib/x86_64-linux-gnu/espeak-data

However, if the inference code is producing the same as in the custom dataset with the same function this should not be an issue I guess.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Make Finetuning Dataset #92

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to Make Finetuning Dataset #92

fakerybakery Nov 23, 2023

Replies: 8 comments

Kreevoz Nov 23, 2023

yl4579 Nov 23, 2023 Maintainer

devidw Nov 23, 2023

fakerybakery Nov 23, 2023 Author

devidw Nov 23, 2023

devidw Nov 25, 2023

Kreevoz Nov 25, 2023

devidw Nov 25, 2023

fakerybakery
Nov 23, 2023

Kreevoz
Nov 23, 2023

yl4579
Nov 23, 2023
Maintainer

devidw
Nov 23, 2023

fakerybakery
Nov 23, 2023
Author

devidw
Nov 23, 2023

devidw
Nov 25, 2023

Kreevoz
Nov 25, 2023

devidw
Nov 25, 2023