Questions about the soundstorm paper #11

olup · 2023-06-13T19:11:06Z

olup
Jun 13, 2023

If I understand correctly, they apply the maskGIT technic to audio. MaskGIT predicts tokens parallelly in multiple pass. It works because in image you have a predefine number of token. How does soundstorm do ? Conceptually, as a sequence to sequence multimodal transformer, it's hard to get free from the autoregressive approach. Do they arbitrarily decide on an audio output length ? In that case, what happen with too long/short text prompt (in the case of tts) ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the soundstorm paper #11

{{title}}

Replies: 0 comments

Select a reply

Questions about the soundstorm paper #11

olup Jun 13, 2023

Replies: 0 comments

olup
Jun 13, 2023