You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your great work on WhisperSpeech! It is very interesting to extract semantic tokens from the Whisper encoder, and use semantic tokens to generate acoustic tokens. I am a green hand at TTS, and have some questions about WhisperSpeech. If you could answer these questions, I am very appreciated.
1) how to design the Infomation bottleneck network for semantic tokens? The semantic tokens from HuBERT is extracted by k-means model, and semantic tokens from Whisper encoder is extracted by the VQ model. However, embeddings from k-means or VQ model still contain speaker infomation. But the cluster indexes from k-means or VQ model remains semantic infomation, and do not contains speaker information, so these cluster indexes is called semantic token. My question is, what is the infomation bottleneck during these processing, VQ, or map the VQ embedding into a 1-dimension token?
2) which semantic token is better, semantic token from Wav2vec2, HuBERT, w2v-bert, or Whisper encoder ? During the semantic -> acoustic processing, which semantic token will lead to higher accuracy of predicting acoustic tokens? During the semantic -> text processing, which semantic token will lead to lower WER?
The text was updated successfully, but these errors were encountered:
Thanks for your great work on WhisperSpeech! It is very interesting to extract semantic tokens from the Whisper encoder, and use semantic tokens to generate acoustic tokens. I am a green hand at TTS, and have some questions about WhisperSpeech. If you could answer these questions, I am very appreciated.
1) how to design the Infomation bottleneck network for semantic tokens? The semantic tokens from HuBERT is extracted by k-means model, and semantic tokens from Whisper encoder is extracted by the VQ model. However, embeddings from k-means or VQ model still contain speaker infomation. But the cluster indexes from k-means or VQ model remains semantic infomation, and do not contains speaker information, so these cluster indexes is called semantic token. My question is, what is the infomation bottleneck during these processing, VQ, or map the VQ embedding into a 1-dimension token?
2) which semantic token is better, semantic token from Wav2vec2, HuBERT, w2v-bert, or Whisper encoder ? During the semantic -> acoustic processing, which semantic token will lead to higher accuracy of predicting acoustic tokens? During the semantic -> text processing, which semantic token will lead to lower WER?
The text was updated successfully, but these errors were encountered: