-
Notifications
You must be signed in to change notification settings - Fork 427
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Make Finetuning Dataset #69
Comments
The one thing that would be of a real utility value is to give users the option to provide non-phonemized Splitting out the phonemization into its own utility function and not repeating it verbatim in the first 4 lines of every inference function definition also would make sense. Then it could be called from anywhere including at auto-dataset generation. You can't blindly trust whisper's transcripts though. I've run a bunch of larger datasets over it and it makes enough insanely stupid mistakes at times (even with the large model) that you will negatively impact your training results if you don't fix it up by hand and with careful listening. |
@Kreevoz One can use https://github.com/jaywalnut310/vits/blob/main/preprocess.py to generate the phonemes. |
working on a pipeline to easily allow building a compatible dataset, https://github.com/devidw/dswav its a gradio ui that allows you to transcribe an input audio and have it be split into samples based on detected sentences and also builds required files for training as @Kreevoz noted, whisper is a source of potential issues, if not carefully checking also splitting at sentences seems not ideal, since sometimes there will be artifacts at the end in the chucked audio samples, doing some sort of splitting based on silence would prob be the better approach |
Thanks for sharing your tool! Would you mind adding a license to it? |
sure, added @fakerybakery |
Hey @yl4579, thx sharing this. I'm trying to replicate this in order to build custom fine-tune datasets, however, when I use the shared script the output looks different from the training data shared in this repo. For example, for
While when I look up the source text
And pipe it into the vits scripts, I get this: - ðə bˈæŋk hɐdbɪŋ kəndˈʌktᵻd ˌɔn fˈɔls pɹˈɪnsɪpəlz ;
+ ðə bˈæŋk hɐdbɪn kəndˈʌktᵻd ˌɑːn fˈɑːls pɹˈɪnsɪpəlz; Using default arguments, in this case - ðə bˈæŋk hɐdbɪŋ kəndˈʌktᵻd ˌɔn fˈɔls pɹˈɪnsɪpəlz ;
+ ðə bæŋk hɐdbɪn kəndʌktᵻd ɑːn fɑːls pɹɪnsɪpəlz Any idea why the output might look different to the one in the repo? I guess it's quite important that we exactly match the formatting like you used, when we fine-tune on the shared checkpoints, right? |
I can chime in on that. You need to modify the StyleTTS2 does next to no text cleaning before the input texts are sent to the phonemizer. The phonemizer strips out most of the junk by itself (that's generally not what you want because you have little control over that). This is what you'd need to phonemize correctly, taken from StyleTTS2:
You then run the As for having to match the exact same formatting, that is not necessary. As long as you phonemize it in the same way you can preprocess the text differently. I've run a couple of finetunes over the past day and replaced the text cleaning with my own more aggressive implementation which adjusts punctuation to a format that the phonemizer likes and preserves. It only matters that you have the exact same text cleaners at inference time too. The finetuned model will adapt to the new style of punctuation and formatting. That way it will also shed the habit of ignoring punctuation that the current pretrained models exhibit. That's because the datasets they were trained on don't have pauses in the audio when there is punctuation to indicate there should be. That behavior can be finetuned out of the model again to increase controllability. |
Thanks a ton @Kreevoz 🙌 I just run the input through the StyleTTS2 phonemes function, and it looks close. https://gist.github.com/devidw/1bb5cd4d9d524218db22d6b0b10b6712 There is a minor difference in 2 phonemes tho I think, no idea where that might be coming from. Not sure if this could be something due to different version/os builds of espeak Tested on
and
However, if the inference code is producing the same as in the custom dataset with the same function this should not be an issue I guess. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hi, for the finetuning dataset, should we use Whisper -> Phonemizer to make it from a list of audio files?
The text was updated successfully, but these errors were encountered: