Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder? #88

st-vincent1 · 2024-01-19T10:25:37Z

I am confused why the speaker embedding g is used to condition multiple model components (Posterior Encoder, Decoder, Flow) as opposed to just Flow.

From the model diagram in Fig. 1 (a) (Training procedure), the speaker embedding g is used to condition the normalising Flow. This makes sense: at inference time, this information in the reversed Flow to reverse the z' distribution into a speaker-informed z which was modelled after the real data x_lin with the Posterior Encoder.

To me this seems like enough supervision, and I am confused why g is used in other places too:

in Posterior Encoder which uses x_lin as input, g is also supplied - but it shouldn't be needed as x_lin already contains the speaker information! (And g is not mentioned in section 2.2.2. of the paper when this encoder is discussed)
in Decoder, similarly, z is already informed with the speaker embedding, so why do we need to explicitly supply it here?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder? #88

Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder? #88

st-vincent1 commented Jan 19, 2024

Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder? #88

Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder? #88

Comments

st-vincent1 commented Jan 19, 2024