maskgct zero shot #5
Replies: 2 comments 1 reply
-
I suppose this will end up more as a nitpick over the provided code and literature rather than my previous F5-TTS evaluation. Thankfully there is a HF demo page that I can use to do some evals against, since the "public demo" page is cumbersome to navigate.
I feel it's a bit weird to make note of, since I feel TTS systems have moved beyond trying to "solve" the problem of needing explicit alignment information since the whole "treating TTS as a language modelling problem" solves it with attention mechanisms. I could be wrong if this is still a problem for pure-NAR models, but again it still seemed like a solved problem for Meta's Voicebox and MS's NaturalSpeech 2, etc, etc. A pure NAR is still rather desired (I still need to revisit it since my training/tests with my approach were too inconsistent), Anyways.
Ick. I guess I can't knock it since if it works, it works, but models that rely on extracted features always leave a bad taste in my mouth. TorToiSe was at least clever with having its extracted features (the input conditioning latents) exist within the respective models itself, but it seems it's asking for seven models in its ensemble, which is a bit of a big ask. The paper mentions using its own VQ-VAE and blah blah blah. It just seems like unnecessary complexity but I suppose its better to provide their own models than rely on, say, one model to serve its needs (EnCodec).
I guess it's standard practice now to use Emilia for datasets over LibriSpeech. For good reason (as I mentioned before with my Emilia assessment). I suppose this does mean I should probably include the rest of the English corpus and probably some Chinese, although I feel having to attest for phonemizing Chinese is a bit of a chore. I think F5 uses pinyin and MaskGCT seems to do so too, but looking at their Aside from the code being a little too unstructured and a bit confusing, the output seems... fine.
And I suppose I can't continue evaluating unless I want to painfully set it up on my own system. F5-TTS was a miracle for being lightweight but I don't look forward to setting MaskGCT up myself from skimming over the install notes. From the very small speaker evaluation:
I was a little skeptical given the literature and the code but the output seems that its capable. It's fine, an unfortunately milkytoast [sic] fine, but that might just be because I haven't got the chance to push it with curveballs to wow me. If I get the free time I'll play with it some more, but my hands are tied right now with falling for the LayerSkip meme. |
Beta Was this translation helpful? Give feedback.
-
Well, color me impressed then. I don't know if the demo page was there the whole time or added recently, but it provides a very thorough evaluation.
|
Beta Was this translation helpful? Give feedback.
-
https://github.com/open-mmlab/Amphion/tree/01066a2abe2019a4131c4c6a75f9b6ab1aa1dc83/models/tts/maskgct
pretty good, sounds better than f5, trained on emilia
Beta Was this translation helpful? Give feedback.
All reactions