Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Here is a bug on linear loss computation #113

Closed
begeekmyfriend opened this issue Jul 25, 2018 · 11 comments
Closed

Here is a bug on linear loss computation #113

begeekmyfriend opened this issue Jul 25, 2018 · 11 comments

Comments

@begeekmyfriend
Copy link
Contributor

begeekmyfriend commented Jul 25, 2018

In the expression computing linear loss, num_mels should have been num_freq. See Keith Ito's version. It seems that this model does not compute the loss including effective bandwidth of the audio.

@Yeongtae
Copy link
Contributor

Yeongtae commented Jul 25, 2018

@begeekmyfriend how to fix it? just replace num_mels to num_freq?
When we fix it, what is the improvement compare with the previous one?

In my opinion, The tacotorn part converges well unlike the wavenet_vocoder part.

@begeekmyfriend
Copy link
Contributor Author

begeekmyfriend commented Jul 25, 2018

linear_loss = 0.5 * tf.reduce_mean(l1) + 0.5 * tf.reduce_mean(l1[:,:,0:n_priority_freq])

This expression means we use 0.5 weight of the whole bandwidth of the frequency plus the remaining 0.5 weight of the priority bandwidth of the frequency as the complete linear loss to train the model. The num_freq factor would make influence on the bandwidth. So in my humble opinion, the higher frequency of the targets would be taken part in fitting the ground truth audio. Therefore Keith Ito's version is all you need.

@Yeongtae
Copy link
Contributor

Yeongtae commented Jul 25, 2018

@begeekmyfriend Thank you for your good opinion.
did you test it? did it reduce noise the result such as 'step-xxxx-wave-from-mels'??

@begeekmyfriend
Copy link
Contributor Author

begeekmyfriend commented Jul 25, 2018

It does nothing with the mel outputs. By the way, the quality of linear outputs typically perform better than mel ones.
By the way, here is my hyper parameters (under testing). We can see that I use 2048 fft size and 1025 number of frequency with Griffin-Lim vocoder.

	#Audio
	num_mels = 80, #Number of mel-spectrogram channels and local conditioning dimensionality
	num_freq = 1025, # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing network
	rescale = True, #Whether to rescale audio prior to preprocessing
	rescaling_max = 0.999, #Rescaling value
	trim_silence = True, #Whether to clip silence in Audio (at beginning and end of audio only, not the middle)
	clip_mels_length = True, #For cases of OOM (Not really recommended, working on a workaround)
	max_mel_frames = 960,  #Only relevant when clip_mels_length = True

	# Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction
	# It's preferred to set True to use with https://github.com/r9y9/wavenet_vocoder
	# Does not work if n_ffit is not multiple of hop_size!!
	use_lws=False,
	silence_threshold=2, #silence threshold used for sound trimming for wavenet preprocessing

	#Mel spectrogram
	n_fft = 2048, #Extra window size is filled with 0 paddings to match this parameter
	hop_size = None, #For 22050Hz, 275 ~= 12.5 ms
	win_size = 1100, #For 22050Hz, 1100 ~= 50 ms (If None, win_size = n_fft)
	sample_rate = 22050, #22050 Hz (corresponding to ljspeech dataset)
	frame_shift_ms = 12.5,

@Yeongtae
Copy link
Contributor

image
@begeekmyfriend does it affect this part?
Thanks a lot.

@begeekmyfriend
Copy link
Contributor Author

begeekmyfriend commented Jul 25, 2018

It definitely does because I have expanded both the fft size and the number of frequency of linear outputs. So the audio signal process would be affected. That is saying you have to pre-process all the audio dataset and run again from scratch. By the way, these hyper parameters do not match wavenet vocoder but only for Griffin-Lim.

@begeekmyfriend
Copy link
Contributor Author

begeekmyfriend commented Jul 25, 2018

We can also use L2 loss for linear outputs if you think L2 is better than L1

n_priority_freq = int(4000 / (hp.sample_rate * 0.5) * hp.num_freq)
linear_loss = 0.5 * tf.losses.mean_squared_error(self.linear_targets, self.linear_outputs) \
        + 0.5 * tf.losses.mean_squared_error(self.linear_targets[:,:,0:n_priority_freq], self.linear_outputs[:,:,0:n_priority_freq])

The reason why we prefer L2 is mentioned here #4 (comment)

@Yeongtae
Copy link
Contributor

Yeongtae commented Jul 26, 2018

This is my test results.

--num_mels: 10000iteration--
step-10000-eval-mel-spectrogram
step-10000-eval-align

--num_freq: 10000iteration--
step-10000-eval-mel-spectrogram
step-10000-eval-align

@begeekmyfriend
Copy link
Contributor Author

begeekmyfriend commented Jul 26, 2018

I am afraid there might be problems in your dataset. In my test it would achieve convergence in 4K steps when I adopted the solutions mentioned both on the 5th and 8th floor that used MSE for linear loss.
step-4000-align
And below is one of the results from Griffin-Lim in 15K steps
step-15000-eval-waveform-linear.zip

@Starlon87
Copy link

@begeekmyfriend 为什么你的4000 steps,loss = 0.59就可以看起来如此收敛,而我这边70000 steps,loss = 0.37还没苗条的曲线?...

step-70000-eval-align

@Rayhane-mamah
Copy link
Owner

@begeekmyfriend my good friend you are correct once more!

I have fixed that.. I apologize for the typo :) Thanks for your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants