Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 #169

Charlottecuc · 2020-06-23T08:27:55Z

Hi. I trained the tacotron2 model (https://github.com/NVIDIA/tacotron2) and Glow-TTS model (https://github.com/jaywalnut310/glow-tts) by using the LJ speech dataset and can successfully synthesize voice by using WaveGlow as vocoder.
However, when I turned to the Parallel WaveGan, the synthzised waveform is quite strange:

(In the training time, the hop_size, sample_rate and window_size were set as the same for the tacotron, WaveGlow and waveGan model.)

I successfully synthesized speech using WaveGan with espnet's FastSpeech, but I failed to use waveGan to synthsize intelligible voice with any model derived from Nivida's Tacotron2 implementation (e.g. Glow-TTS). Could you please give me any advice?
(Because in Nivida's Tacotron2, there is no cmvn to the input mel-spectrogram features, so I didn't calculate the cmvn of the training waves and didn't invert it back at the inference time)

Thank you very much!

The text was updated successfully, but these errors were encountered:

kan-bayashi · 2020-06-24T02:11:34Z

Let me confirm some points:

Did you check the range of Mel basis? We use 80-7600.
Not sure but I use librosa to extract mel-spectrogram. Maybe Nvidia's extractor is different. This may cause the mismatch.
In inference, how did you perform normalization? txt -> [Taco2] -> Mel -> ? -> [PWG]?

Charlottecuc · 2020-06-24T05:00:52Z

Thank you for your reply.

Yes. I checked the range of Mel basis. It's 80-7600.
Nvidia uses scipy.io.wavfile.read to read wave files and librosa is not used if input_data.device.type == "cuda" (otherwise, librosa will be used) . (https://github.com/jaywalnut310/glow-tts/blob/00a482d06ebbffbd3518a43480cd79e7b47ebbe2/stft.py#L78)
Since I used different datasets for Text2Mel and Mel2Wav models, I firstly read stats.h5 of pretrained WaveGan and got new_mean and new_square from stats.h5. Then, (Mel - new_mean) / new_square

Thanks!

kan-bayashi · 2020-06-24T07:00:00Z

I checked the following code.
https://github.com/jaywalnut310/glow-tts/blob/00a482d06ebbffbd3518a43480cd79e7b47ebbe2/commons.py#L164-L181
~~It seems that it does not apply log for mel-spectrogram and performs dynamic range compression.~~
They applied log as a dynamic compression. But we use log10.

ParallelWaveGAN/parallel_wavegan/bin/preprocess.py

Line 64 in 72d5f2c

return np.log10(np.maximum(eps, np.dot(spc, mel_basis.T)))

Why don't you try the following procedure?

txt -> [Taco2] -> Mel -> [de-compression] -> [log10] -> [cmvn] -> [PWG]

seantempesta · 2020-06-25T06:25:55Z

I'm having the same problem, but I don't understand the [cmvn] step you referenced above. I tried de-compressing the mel and then applying log10, and I can kind of make out words in the audio, but it's super noisy. Would you mind elaborating?

Here's what I've got:

from audio_processing import dynamic_range_decompression

# generate the MEL using Glow-TTS
with torch.no_grad():
    noise_scale = .667
    length_scale = 1.0
    (c, *r), attn_gen, *_ = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale)
   
# Decompress and log10 the output
decompressed = dynamic_range_decompression(c)
decompressed_log10 = np.log10(decompressed.cpu()).cuda()

# Run the PWG vocoder and play the output
with torch.no_grad():
    xx = (decompressed_log10,)
    y = pqmf.synthesis(vocoder(*xx)).view(-1)      
    
from IPython.display import display, Audio
display(Audio(y.view(-1).cpu().numpy(), rate=config["sampling_rate"]))

kan-bayashi · 2020-06-25T06:45:50Z

[cmvn] means the mean-var normalization using stats.h5 of PWG model.
In your case,

# load PWG statistics
mu = read_hdf5("/path/to/stats.h5", "mean")
var = read_hdf5("/path/to/stats.h5", "scale")
sigma = np.sqrt(var)

# mean-var normalization
decompressed_log10_norm = (decompressed_log10 - mu) / sigma

# then input to vocoder
...

seantempesta · 2020-06-25T07:41:56Z

@kan-bayashi You are amazing! For anyone else running into this, you have to change the tensor shapes for mu and var to get this to work. This is what I did (please let me know if this isn't right):

from audio_processing import dynamic_range_decompression
from parallel_wavegan.utils import read_hdf5

# config
stats_path = '/path/to/stats.h5'

# generate the MEL using Glow-TTS
with torch.no_grad():
    noise_scale = .667
    length_scale = 1.0
    (c, *r), attn_gen, *_ = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale)
   
# Decompress and log10 the output
decompressed = dynamic_range_decompression(c)
decompressed_log10 = np.log10(decompressed.cpu()).cuda()

# mean-var normalization
mu = read_hdf5(stats_path, "mean")
var = read_hdf5(stats_path, "scale")
sigma = np.sqrt(var)
decompressed_log10_norm = (decompressed_log10 - torch.from_numpy(mu).view(1, -1, 1).cuda()) / torch.from_numpy(sigma).view(1, -1, 1).cuda()

# Run the PWG vocoder and play the output
with torch.no_grad():
    xx = (decompressed_log10_norm,)
    y = pqmf.synthesis(vocoder(*xx)).view(-1)      
    
from IPython.display import display, Audio
display(Audio(y.view(-1).cpu().numpy(), rate=config["sampling_rate"]))

kan-bayashi · 2020-06-25T08:23:18Z

@seantempesta Great!

kan-bayashi · 2020-06-25T09:31:42Z

Now we can combine with Nividia's tacotron2-based models.
I will close this issue and write some note in README.

Charlottecuc · 2020-06-28T06:55:15Z

Hi @seantempesta. I copied your code and got the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-60-39f0bece3260> in <module>
     18 with torch.no_grad():
     19     xx = (decompressed_log10_norm,)
---> 20     y = pqmf.synthesis(vocoder(*xx)).view(-1)

/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    545             result = self._slow_forward(*input, **kwargs)
    546         else:
--> 547             result = self.forward(*input, **kwargs)
    548         for hook in self._forward_hooks.values():
    549             hook_result = hook(self, input, result)

TypeError: forward() missing 1 required positional argument: 'c'

Could you give me some advice?(e.g. In pqmf.synthesis(vocoder(*xx)), what's your vocoder function look like? Why don't you use use_noise_input as in https://github.com/kan-bayashi/ParallelWaveGAN/blob/1f5899732f78aac3883441c191b0870466a420f0/parallel_wavegan/bin/decode.py ?)
Thanks!

kan-bayashi · 2020-06-28T09:04:20Z

@Charlottecuc There are three vocoder models in this repository, PWG, MelGAN, and multi-band MelGAN.
The input is different for each model:

PWG: mel-spectrogram and noise
MelGAN and multi-band MelGAN: mel-spectrogram.

And only the multi-band MelGAN needs PQMF filter as a post-processing to convert 4 ch signal into 1 ch signal.

@seantempesta used multi-band MelGAN, so the input is only c and pqmf.synthesis is applied.
In your @Charlottecuc case, you want to use PWG. So you do not need to use PQMF.
Please remove pqmf.synthesis.

Charlottecuc · 2020-06-28T11:24:59Z

@kan-bayashi Great!!!!!Thank you for your advice.

kan-bayashi · 2020-06-28T12:43:55Z

I've never met such a problem.
Maybe you made something wrong (e.g., sampling rate).

Charlottecuc · 2020-07-01T02:49:04Z

Solved. Thank you :)

ly1984 · 2020-07-24T04:22:36Z

@kan-bayashi the inference voice with lots of noise, could you please take a look?

kan-bayashi · 2020-07-24T04:32:20Z

@ly1984 Please check the hyperparameters of mel-spectrogram extraction. Maybe you use different fmax and fmin.

ly1984 · 2020-07-24T04:46:35Z

I've found the WaveGAN fmin: 80 fmax: 7600 and Glow-TTS "mel_fmin": 0.0, "mel_fmax": 8000.0,
should I retrain the glow-its model with the same parameter with WaveGAN?

kan-bayashi · 2020-07-24T04:48:01Z

Yes. You need to retrain PWG or Glow-TTS to match the configuration.

rijulg · 2020-08-19T12:44:19Z

@kan-bayashi I am trying to use your models with mel spectrogram output from Nvidia's models, and although the above suggested methods get some results, the results are rather lackluster. Here's a colab notebook of an experiment
https://colab.research.google.com/drive/1uOLIzWHF4FbRScuIeUQiMQyCE_zpphBe?usp=sharing
In the above experiment I have the same audio processed in following ways:

audio -> pwgan_mel -> pwgan_inference (vctk_multi_band_melgan.v2) -> audio [OK]
audio -> pwgan_mel -> pwgan_inference (libritts_parallel_wavegan.v1.long) -> audio [Good]
audio -> tacotron2_mel -> pwgan_inference (libritts_parallel_wavegan.v1.long) -> audio [Terrible]
audio -> tacotron2_mel -> mean-var normalization -> pwgan_inference (libritts_parallel_wavegan.v1.long) -> audio [Poor]

I observe the following from the experiment:

vctk_multi_band_melgan.v2 trained model performs slightly worse than libritts_parallel_wavegan.v1.long on my selected audio.
tacotron2_mel output does not work with the trained models, as expected
tacotron2_mel output with mean-var normalization does not produce high quality outputs
tacotron2_mel lengths are (1200/1024) times that of pwgan_mel (this is expected because of the window size difference)

Can you please comment on, and help me identify any problems with

Identifying cause of quality difference between the 2 trained models
- I suspect it is a combination of the dataset used to train and the total steps involved in training
Reason for poor quality of tacotron2_mel even after mean-var normalization
- I suspect that the conversion is not able to match up with the difference in fft_size and window_size

kan-bayashi · 2020-08-19T12:51:11Z

@rijulg Did you check this comment?
#169 (comment)
The base of log is different, so you need to convert [taco2_mel] -> exp -> log10 -> mean-var norm.

And of course to the best quality, you need to match the feature extraction setting (e.g. FFT, shift).

rijulg · 2020-08-19T13:00:30Z

@kan-bayashi yes, I am indeed doing the log base conversion; I guess I (mistakenly) considered the log conversion part of your mean-var normalization process so did not mention it separately.

kan-bayashi · 2020-08-19T13:03:48Z

In your code, the range of mel-basis is different.
That is the reason of quality degradation.

rijulg · 2020-08-19T13:08:45Z

Ah, alright. Just to confirm there is no way of scaling right? Leaving the only option of retraining the models?

kan-bayashi · 2020-08-19T13:09:47Z

Unfortunately, you need to retrain :(

Zarbuvit · 2020-10-01T13:36:11Z

@seantempesta I tried your fix for GlowTTS inference with multiband melgan using the Mozilla-TTS multiband melgan, and though it did take away the noisy background but it left me with garbled up words.
Did this ever happen to you?
I tried, and I would prefer, to use the multiband melgan model provided by kan-bayashi but I couldnt figure out how to load the model properly and kept running into tensor size incompatibility issues. If it is possible for you to share how you got vocoder from your inference line y = pqmf.synthesis(vocoder(*xx)).view(-1) I would be very greatful - it will probably solve for me what I have had problems with for the past week.

Zarbuvit · 2020-10-08T13:14:34Z

I ended up trying another method provided in the repo https://github.com/rishikksh20/melgan. I got the same garbled voice results.
I then realized that I had a personal mistake of mixing up my training and inference models and that was the cause for the garbling. Once I fixed this everything worked perfect!
This is all to say, I am not sure if this would have solved the garbling problem I had here, I didn't check, but I am almost certain that @seantempesta code did actually work for me and any problem I had was on my side of things.
I am sorry for any problems or wasted time I may have caused.

lucashueda · 2020-11-23T05:32:53Z

Since the sound itself is not affected by a wav normalization (audio /= (1 << (16 -1)), is there a way to use a PWGAN trained without wav norm to synthetize tacotron2 model output trained with wav norm?

Additionally, someone know if wav norm is needed to converge tacotron2? I tried without wav norm to match a internal PWGAN trained without wav norm, but tacotron2 run 30k steps without attention alignment.

kan-bayashi · 2020-11-23T06:08:19Z

@lucashueda I did not understand ‘wav norm’ you mentioned. Did you use wav with the scale from -66536 to 66536?

lucashueda · 2020-11-23T09:42:55Z

@lucashueda I did not understand ‘wav norm’ you mentioned. Did you use wav with the scale from -66536 to 66536?

With "wav norm" I mean " audio /= (1 << (16-1)) " to make a 16bit PCM file between -1 and +1. But i realize that different wav readers read these files differently, I was just confused with the "bin/preprocess.py" file where if I put the input_dir argument it just calls "load_wav" to read a wav file but if you pass a kaldi style file it performs the "wav norm", but as I saw the soundfile package already performs the normalization.

kan-bayashi · 2020-11-23T10:18:26Z

Both cases will normalize the audio from -1 to 1.
The 'wav norm' is needed sicne kaldiio output is not normalized.

wizardk · 2021-04-27T06:34:02Z

[cmvn] means the mean-var normalization using stats.h5 of PWG model.
In your case,

# load PWG statistics
mu = read_hdf5("/path/to/stats.h5", "mean")
var = read_hdf5("/path/to/stats.h5", "scale")
sigma = np.sqrt(var)

# mean-var normalization
decompressed_log10_norm = (decompressed_log10 - mu) / sigma

# then input to vocoder
...

Hi @kan-bayashi , why do you need to do np.sqrt(var) ? In the compute_statistics.py, you have saved scale_ instead of var_.

kan-bayashi · 2021-04-28T02:44:40Z

Oh, scale is std so we should use scale as sigma here.
Thank you for pointing out.

kan-bayashi added the question Further information is requested label Jun 24, 2020

kan-bayashi closed this as completed Jun 25, 2020

Charlottecuc changed the title ~~Cannot use WaveGAN with Glow-TTS and Nivida's Tacotron2~~ Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 Jun 28, 2020

Charlottecuc mentioned this issue Jun 28, 2020

Using Parallel WaveGan as vocoder NVIDIA/tacotron2#379

Closed

mailong25 mentioned this issue Jun 30, 2020

TTS Vocoder for new languages espnet/espnet#2123

Closed

chazo1994 mentioned this issue Jun 30, 2020

Not compatible with nvidia-tacotron? #175

Closed

kan-bayashi mentioned this issue Jul 6, 2020

WaveGAN training on Tacotron outputs. #178

Closed

Zarbuvit mentioned this issue Oct 1, 2020

GlowTTS with MultiBand Melgan jaywalnut310/glow-tts#38

Closed

kan-bayashi mentioned this issue Feb 23, 2021

Using mel spectrogram in different vocoders espnet/espnet#2998

Closed

kan-bayashi mentioned this issue Jun 17, 2021

Using WaveGlow Vocoder with output of Text2Speech in TTS inference espnet/espnet#3295

Closed

kan-bayashi mentioned this issue Mar 22, 2022

Inference of Multiband MelGAN (v2) with ForwardTacotron #346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 #169

Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 #169

Charlottecuc commented Jun 23, 2020

kan-bayashi commented Jun 24, 2020

Charlottecuc commented Jun 24, 2020 •

edited

Loading

kan-bayashi commented Jun 24, 2020 •

edited

Loading

seantempesta commented Jun 25, 2020

kan-bayashi commented Jun 25, 2020

seantempesta commented Jun 25, 2020 •

edited

Loading

kan-bayashi commented Jun 25, 2020

kan-bayashi commented Jun 25, 2020 •

edited

Loading

Charlottecuc commented Jun 28, 2020 •

edited

Loading

kan-bayashi commented Jun 28, 2020

Charlottecuc commented Jun 28, 2020

kan-bayashi commented Jun 28, 2020

Charlottecuc commented Jul 1, 2020

ly1984 commented Jul 24, 2020 •

edited

Loading

kan-bayashi commented Jul 24, 2020 •

edited

Loading

ly1984 commented Jul 24, 2020

kan-bayashi commented Jul 24, 2020 •

edited

Loading

rijulg commented Aug 19, 2020

kan-bayashi commented Aug 19, 2020 •

edited

Loading

rijulg commented Aug 19, 2020

kan-bayashi commented Aug 19, 2020

rijulg commented Aug 19, 2020

kan-bayashi commented Aug 19, 2020

Zarbuvit commented Oct 1, 2020

Zarbuvit commented Oct 8, 2020

lucashueda commented Nov 23, 2020

kan-bayashi commented Nov 23, 2020

lucashueda commented Nov 23, 2020 •

edited

Loading

kan-bayashi commented Nov 23, 2020

wizardk commented Apr 27, 2021

kan-bayashi commented Apr 28, 2021 •

edited

Loading

Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 #169

Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 #169

Comments

Charlottecuc commented Jun 23, 2020

kan-bayashi commented Jun 24, 2020

Charlottecuc commented Jun 24, 2020 • edited Loading

kan-bayashi commented Jun 24, 2020 • edited Loading

seantempesta commented Jun 25, 2020

kan-bayashi commented Jun 25, 2020

seantempesta commented Jun 25, 2020 • edited Loading

kan-bayashi commented Jun 25, 2020

kan-bayashi commented Jun 25, 2020 • edited Loading

Charlottecuc commented Jun 28, 2020 • edited Loading

kan-bayashi commented Jun 28, 2020

Charlottecuc commented Jun 28, 2020

kan-bayashi commented Jun 28, 2020

Charlottecuc commented Jul 1, 2020

ly1984 commented Jul 24, 2020 • edited Loading

kan-bayashi commented Jul 24, 2020 • edited Loading

ly1984 commented Jul 24, 2020

kan-bayashi commented Jul 24, 2020 • edited Loading

rijulg commented Aug 19, 2020

kan-bayashi commented Aug 19, 2020 • edited Loading

rijulg commented Aug 19, 2020

kan-bayashi commented Aug 19, 2020

rijulg commented Aug 19, 2020

kan-bayashi commented Aug 19, 2020

Zarbuvit commented Oct 1, 2020

Zarbuvit commented Oct 8, 2020

lucashueda commented Nov 23, 2020

kan-bayashi commented Nov 23, 2020

lucashueda commented Nov 23, 2020 • edited Loading

kan-bayashi commented Nov 23, 2020

wizardk commented Apr 27, 2021

kan-bayashi commented Apr 28, 2021 • edited Loading

Charlottecuc commented Jun 24, 2020 •

edited

Loading

kan-bayashi commented Jun 24, 2020 •

edited

Loading

seantempesta commented Jun 25, 2020 •

edited

Loading

kan-bayashi commented Jun 25, 2020 •

edited

Loading

Charlottecuc commented Jun 28, 2020 •

edited

Loading

ly1984 commented Jul 24, 2020 •

edited

Loading

kan-bayashi commented Jul 24, 2020 •

edited

Loading

kan-bayashi commented Jul 24, 2020 •

edited

Loading

kan-bayashi commented Aug 19, 2020 •

edited

Loading

lucashueda commented Nov 23, 2020 •

edited

Loading

kan-bayashi commented Apr 28, 2021 •

edited

Loading