Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthesize own text without style transfer gives poor audio results #120

Open
ocesp98 opened this issue Feb 7, 2023 · 1 comment
Open

Comments

@ocesp98
Copy link

ocesp98 commented Feb 7, 2023

When trying to synthesize my own text using the pretrained mellotron and waveglow models, I get poor audio quality (very croaky voice).
I use the inference method to not perform style transfer, however, I am also not sure what to pass in as input_style and f0s.
The following code is just to synthesize on speaker id 0 of the pretrained model. Is it normal the audio quality is relatively poor? My end goal is to finetune this model on a speech dataset in another language with 2 speakers.

text = "This is an example sentence."
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()

f0 = torch.zeros([1, 1, 32]).cuda()
speaker_id = torch.LongTensor([0]).cuda()

with torch.no_grad():
    mel_outputs, mel_outputs_postnet, gate_outputs, alignments = mellotron.inference(
        (text_encoded, 0, speaker_id, f0))

with torch.no_grad():
    audio = denoiser(waveglow.infer(mel_outputs_postnet, sigma=0.7), 0.01)[:, 0]
ipd.Audio(audio[0].data.cpu().numpy(), rate=hparams.sampling_rate)

@mepc36
Copy link

mepc36 commented Mar 2, 2023

Just upvoting to say I had same problem, so that's +1 for the "this might be normal" vote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants