maskgct zero shot #5

kunibald413 · 2024-11-03T19:27:20Z

kunibald413
Nov 3, 2024

https://github.com/open-mmlab/Amphion/tree/01066a2abe2019a4131c4c6a75f9b6ab1aa1dc83/models/tts/maskgct

pretty good, sounds better than f5, trained on emilia

e-c-k-e-r · 2024-11-03T22:32:40Z

e-c-k-e-r
Nov 3, 2024
Maintainer

I suppose this will end up more as a nitpick over the provided code and literature rather than my previous F5-TTS evaluation. Thankfully there is a HF demo page that I can use to do some evals against, since the "public demo" page is cumbersome to navigate.

a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction

I feel it's a bit weird to make note of, since I feel TTS systems have moved beyond trying to "solve" the problem of needing explicit alignment information since the whole "treating TTS as a language modelling problem" solves it with attention mechanisms. I could be wrong if this is still a problem for pure-NAR models, but again it still seemed like a solved problem for Meta's Voicebox and MS's NaturalSpeech 2, etc, etc.

A pure NAR is still rather desired (I still need to revisit it since my training/tests with my approach were too inconsistent), but the provided literature seems a little vague about it with calling it a "masked generative model", yet the code seems to suggest it's a diffuser, but DiffLlama only has two variables tagged for diffusion but they're actually just for positional embedding, but then there's model wrappers around the DiffLlama that refer to diffusion too, but I'll still admit I'm not too familiar with the actual workings of diffusers and the code tagged as such is a bit unreadable to me at the moment. I suppose since it's glued onto a Llama, it feels like one of my proposed ideas of a pure-NAR Llama mock-diffuser where I run an RVQ level through the NAR a few times to "clean" it up similar to a diffuser. Actually I'm very stupid, since it's in the name: it's a masked transformer, and masked transformers do exactly as I proposed. I am very smart.

Anyways.

in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model

Ick. I guess I can't knock it since if it works, it works, but models that rely on extracted features always leave a bad taste in my mouth. TorToiSe was at least clever with having its extracted features (the input conditioning latents) exist within the respective models itself, but it seems it's asking for seven models in its ensemble, which is a bit of a big ask.

The paper mentions using its own VQ-VAE and blah blah blah. It just seems like unnecessary complexity but I suppose its better to provide their own models than rely on, say, one model to serve its needs (EnCodec).

Datasets. We use the Emilia [47] dataset to train our models. Emilia is a multilingual and diverse
in-the-wild speech dataset designed for large-scale speech generation. In this work, we use English
and Chinese data from Emilia, each with 50K hours of speech (totaling 100K hours).

I guess it's standard practice now to use Emilia for datasets over LibriSpeech. For good reason (as I mentioned before with my Emilia assessment).

I suppose this does mean I should probably include the rest of the English corpus and probably some Chinese, although I feel having to attest for phonemizing Chinese is a bit of a chore. I think F5 uses pinyin and MaskGCT seems to do so too, but looking at their g2p vocab code seems unneccessarily thorough.

Aside from the code being a little too unstructured and a bit confusing, the output seems... fine.

It suffers from the same problem as F5-TTS where it sometimes have some very small jumk at the very beginning.
It might just be because it's on HF Zero, but it seems to take quite a while (40 seconds) no matter the output duration.
Some speakers seem to suffer from the typical TTS problem of the output prosody being too fast, but some speakers don't suffer from this?
It maintains the input prompt a bit suprisingly well (when it wants to), emotion is maintained, but sometimes neglected at the very end.

And I suppose I can't continue evaluating unless I want to painfully set it up on my own system. F5-TTS was a miracle for being lightweight but I don't look forward to setting MaskGCT up myself from skimming over the install notes.

From the very small speaker evaluation:

GLaDOS (Portal) seemed to have prosody issues and seemed too harsh, suggesting volume normalization problems.
- Ellen McLain's pre-processed voice line (the one from the dev commentary) seemed to behave fine. I incidentally noticed my VALL-E model seemed to have problems with the pre-LayerSkip-aware post-training, but such post-training made that input utterance work, so seeing it work here is just a thing to note.
Yosuke (Persona 4) easily can maintain the prompt and has almost-matching prosody, but output does sound like a close Yuri Lowenthal impersonator.
The Half-Life tride voice seems fine. I didn't note much issues.
The HEV fvox seems fine. Output doesn't quite match prosody, but it's fine.
G-Man (HL1) falls apart, he speaks too fast.

I was a little skeptical given the literature and the code but the output seems that its capable. It's fine, an unfortunately milkytoast [sic] fine, but that might just be because I haven't got the chance to push it with curveballs to wow me.

If I get the free time I'll play with it some more, but my hands are tied right now with falling for the LayerSkip meme.

1 reply

e-c-k-e-r Nov 4, 2024
Maintainer

Had some time to set it up locally on my 7900XTX system and inference through my phone (where I don't have much of my samples but I have enough to toy with).

A bit rough but not the worst setup I've had to do; just a lot of git commands, a fresh venv, and editing the requirements to not hard require torch-2.0.1.
The models are fairly large and runtime VRAM usage is a constant 13GiB.
- ~~Which reminds me I don't think I noted F5-TTS's average VRAM usage, or my VALL-E's~~. F5-TTS's average VRAM usage is a comfy 5.5GiB.
- I imagine most of the VRAM cost comes from keeping all the other models in the ensemble loaded in VRAM.
Inference time is much, much, much better than the HF Zero demo. About the output audio duration time + 2 seconds, much better than the ~40 seconds.
- F5-TTS is slightly faster most of the time? Other times its about the same. It's really hard to get good times because neither of the two actually print times, and I have to rely on the gradio timer at that specific module.
Output is... still fine.
- I noticed the lingering artifact in some of the outputs are a very small piece of the input prompt itself. I'd have to read the paper / code again to see how exactly it's inferencing but I imagine it's similar to F5-TTS where the input prompt is the prefix to the output audio. A bit strange, but I guess plausible given the rest of the behaviors.
- Normal voices, naturally, get normal outputs. I haven't further tested emotions, but the sample I tested last night wowed me enough I can trust it does it well enough.
- Abnormal prosodies will not get mapped well, and just get mushed in with the underlying voice features. I haven't tested this further than more G-Man (Half-Life 2 this time), but I imagine it'd be a similar situation as my VALL-E when it had the sampling rate problems and did its best to try and map the input to its own internal latent space. Having said this, I suppose the Cyberpunk 2077 voices would be a good candidate to try, as my base VALL-E didn't perform so well with them without a LoRA.
One eval I like to test is to have it output the transcription of the input prompt. This seems to be a nice way to see if the model is wise enough to map phonemes<=>utternaces directly from the input prompt, and how well it can extrapolate the mapping.
- My VALL-E seems to do this pretty well. Trimming the prompt to have it extrapolate the rest of the output sometimes works well, other times the transition is immediately apparent, but that transition is still a nice marker. Even the aformentioned clip of Ellen McLain seems to have the model act as a fancy denoiser (with the right settings and astral alignment).
- MaskGCT sorta does this fine? It's weird. GLaDOS (Portal 1) didn't seem to map strongly back to the input prompt, where inflectings/intonations of words aren't reflected well. Mitsuru (Persona 3) did almost fine, but the pacing is off. Aigis (Persona 3) is the same, where it's almost fine, but some intonations and pacing aren't matched well.
  - I do find it strange that it can do this with it using extracted features as the input, but I suppose my shallow understanding on how it works is wrong. I guess it works out fine if those extracted features are the audio intermediary (similar to how VALL-E is using EnCodec).
- ~~I need to go back and check F5-TTS on how it handles this. I feel it can do this very well given how it works.~~ F5-TTS performs about the same MaskGCT does in this regard.
Another eval I need to test is with rather informal input prompts (example: mid 2000s forum posts) and the like.
- I don't recall how much I did this with my VALL-E, so I don't have much notes about it beyond some weird pronounciation quirks (that normalizing the text before phonemizing can fix).
- MaskGCT seems to do a little surpisingly fine, but also not surpisingly unfine? One test transciption (I had saved to my phone and never bothered to remove it from the pinned clipboard) exacerbates the pacing issues between sentences, but inflections from an exclamation mark is followed. Multiple exclamation marks are some weird utterance that leaks into the next sentence. "HAHAHAHAHAHA" falls apart but it's still laughing, so.
- I suppose the best approach to do this eval is to turn my brain off and type absolute slop to push a model's capabilities.

It's a bit unfortunate that F5-TTS caught my attention before MaskGCT did and biased my expectations. I should probably do another eval on F5-TTS and compare the two against one another, but it just seems to perform similar and doesn't offer much uniqueness inherently.

Having said this though, I feel I should have a concrete rubric for my evaluations. All the papers care about are stuff like similarities and word-error-rate when all I care about are what I try and nitpick my VALL-E for in de-facto outputs.

I'll play around with both of them during my free time to distract me from my VALL-E's LayerSkip-aware post-training that seems to have my a little worried about frying again (which also requires its own eval).

I should also note that again, this is under ROCm. My VALL-E seems to really be a brat under ROCm, while it behaves under CUDA mostly fine. I did not notice any issues with F5-TTS, though, and there doesn't seem to be a difference between HF Zero-hosted MaskGCT and my locally hosted MaskGCT under ROCm.

e-c-k-e-r · 2024-11-08T01:22:42Z

e-c-k-e-r
Nov 8, 2024
Maintainer

Well, color me impressed then. I don't know if the demo page was there the whole time or added recently, but it provides a very thorough evaluation.

I appreciate that it does provide samples for the more common subjects for TTS that I typically end up covering.
- The samples aren't as strong as they could be; the provided Rick sample sounds slightly off and the waveform for the Morty sample sounds "unconfident" (for lack of a better term; the issue sounds similar to how my VALL-E gets with speakers its not very confident in)
the entire section on emotion and prosody definitely shows how strong the model is, since that was my main point of interest during my (brief) evals.
the robustness section seems like how I do my internal evals on my VALL-E with Harvard sentences, except not as intentionally being tongue twisters. I suppose I need to eval more on robustness in terms of tongue twisters. Impresed.
what does surprise me is that it's able to do speech editing and voice conversion without seemingly needing to train for it. I've been meaning to test how to go about voice conversion with my VALL-E, since I've noticed it emerge briefly during LoRA trainings, but have yet to find the time (and know-how) to milk it out.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maskgct zero shot #5

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

maskgct zero shot #5

kunibald413 Nov 3, 2024

Replies: 2 comments · 1 reply

e-c-k-e-r Nov 3, 2024 Maintainer

e-c-k-e-r Nov 4, 2024 Maintainer

e-c-k-e-r Nov 8, 2024 Maintainer

kunibald413
Nov 3, 2024

Replies: 2 comments 1 reply

e-c-k-e-r
Nov 3, 2024
Maintainer

e-c-k-e-r Nov 4, 2024
Maintainer

e-c-k-e-r
Nov 8, 2024
Maintainer