Synthesize own text without style transfer gives poor audio results #120

ocesp98 · 2023-02-07T10:30:52Z

When trying to synthesize my own text using the pretrained mellotron and waveglow models, I get poor audio quality (very croaky voice).
I use the inference method to not perform style transfer, however, I am also not sure what to pass in as input_style and f0s.
The following code is just to synthesize on speaker id 0 of the pretrained model. Is it normal the audio quality is relatively poor? My end goal is to finetune this model on a speech dataset in another language with 2 speakers.

text = "This is an example sentence."
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()

f0 = torch.zeros([1, 1, 32]).cuda()
speaker_id = torch.LongTensor([0]).cuda()

with torch.no_grad():
    mel_outputs, mel_outputs_postnet, gate_outputs, alignments = mellotron.inference(
        (text_encoded, 0, speaker_id, f0))

with torch.no_grad():
    audio = denoiser(waveglow.infer(mel_outputs_postnet, sigma=0.7), 0.01)[:, 0]
ipd.Audio(audio[0].data.cpu().numpy(), rate=hparams.sampling_rate)

The text was updated successfully, but these errors were encountered:

mepc36 · 2023-03-02T13:54:25Z

Just upvoting to say I had same problem, so that's +1 for the "this might be normal" vote.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthesize own text without style transfer gives poor audio results #120

Synthesize own text without style transfer gives poor audio results #120

ocesp98 commented Feb 7, 2023

mepc36 commented Mar 2, 2023

Synthesize own text without style transfer gives poor audio results #120

Synthesize own text without style transfer gives poor audio results #120

Comments

ocesp98 commented Feb 7, 2023

mepc36 commented Mar 2, 2023