Speech-to-Speech (Zero-Shot Voice Conversion) #82
Replies: 3 comments 3 replies
-
I think what you mean is voice conversion. I have done this for StyleTTS and it should work for StyleTTS 2 as well. See https://github.com/yl4579/StyleTTS-VC. The idea is you align the text with the input melspectrograms, use the aligned phonemes F0 and energy of the input and a different speaker embedding to reconstruct the speech. The alignment and text can be replaced with some text encoder, as shown in StyleTTS-VC. |
Beta Was this translation helpful? Give feedback.
-
I think this is interesting though, so if someone wants to apply the idea of StyleTTS-VC to StyleTTS 2 and use encoders like https://github.com/auspicious3000/contentvec to better disentangle the speaker information, it'd be greatly appreciated. Unfortunately I don't have the time to do this right now. |
Beta Was this translation helpful? Give feedback.
-
The basic idea would be train an acoustic model using |
Beta Was this translation helpful? Give feedback.
-
Is it possible to implement speech-to-speech somewhat similar to this?
Here's an image of how they do it from the website:
![image](https://private-user-images.githubusercontent.com/76186054/285067611-f9317b2e-d457-4469-99b6-9af0f496231f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk3MTQwNjcsIm5iZiI6MTczOTcxMzc2NywicGF0aCI6Ii83NjE4NjA1NC8yODUwNjc2MTEtZjkzMTdiMmUtZDQ1Ny00NDY5LTk5YjYtOWFmMGY0OTYyMzFmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE2VDEzNDkyN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTA5ZGQxMDAyMzc1YWJlN2E3OWMzMzg5M2Q0OWM5NzM4Y2YzYjIyOGRkNmYwNzE3ZjkzZWUzY2E3YjEwMGRmMmEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.ticEFeDvD9lZC0O87vxocENqKz4ZJdB8CYziuS7Pubo)
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions