the difference between semantic tokens like HuBERT, wav2vec2 and Whisper #149

r666ay · 2024-08-07T13:23:09Z

Thanks for your great work on WhisperSpeech! It is very interesting to extract semantic tokens from the Whisper encoder, and use semantic tokens to generate acoustic tokens. I am a green hand at TTS, and have some questions about WhisperSpeech. If you could answer these questions, I am very appreciated.

1) how to design the Infomation bottleneck network for semantic tokens? The semantic tokens from HuBERT is extracted by k-means model, and semantic tokens from Whisper encoder is extracted by the VQ model. However, embeddings from k-means or VQ model still contain speaker infomation. But the cluster indexes from k-means or VQ model remains semantic infomation, and do not contains speaker information, so these cluster indexes is called semantic token. My question is, what is the infomation bottleneck during these processing, VQ, or map the VQ embedding into a 1-dimension token?

2) which semantic token is better, semantic token from Wav2vec2, HuBERT, w2v-bert, or Whisper encoder ? During the semantic -> acoustic processing, which semantic token will lead to higher accuracy of predicting acoustic tokens? During the semantic -> text processing, which semantic token will lead to lower WER?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the difference between semantic tokens like HuBERT, wav2vec2 and Whisper #149

the difference between semantic tokens like HuBERT, wav2vec2 and Whisper #149

r666ay commented Aug 7, 2024

the difference between semantic tokens like HuBERT, wav2vec2 and Whisper #149

the difference between semantic tokens like HuBERT, wav2vec2 and Whisper #149

Comments

r666ay commented Aug 7, 2024