-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Here is a bug on linear loss computation #113
Comments
@begeekmyfriend how to fix it? just replace num_mels to num_freq? In my opinion, The tacotorn part converges well unlike the wavenet_vocoder part. |
This expression means we use 0.5 weight of the whole bandwidth of the frequency plus the remaining 0.5 weight of the priority bandwidth of the frequency as the complete linear loss to train the model. The |
@begeekmyfriend Thank you for your good opinion. |
It does nothing with the mel outputs. By the way, the quality of linear outputs typically perform better than mel ones. #Audio
num_mels = 80, #Number of mel-spectrogram channels and local conditioning dimensionality
num_freq = 1025, # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing network
rescale = True, #Whether to rescale audio prior to preprocessing
rescaling_max = 0.999, #Rescaling value
trim_silence = True, #Whether to clip silence in Audio (at beginning and end of audio only, not the middle)
clip_mels_length = True, #For cases of OOM (Not really recommended, working on a workaround)
max_mel_frames = 960, #Only relevant when clip_mels_length = True
# Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction
# It's preferred to set True to use with https://github.com/r9y9/wavenet_vocoder
# Does not work if n_ffit is not multiple of hop_size!!
use_lws=False,
silence_threshold=2, #silence threshold used for sound trimming for wavenet preprocessing
#Mel spectrogram
n_fft = 2048, #Extra window size is filled with 0 paddings to match this parameter
hop_size = None, #For 22050Hz, 275 ~= 12.5 ms
win_size = 1100, #For 22050Hz, 1100 ~= 50 ms (If None, win_size = n_fft)
sample_rate = 22050, #22050 Hz (corresponding to ljspeech dataset)
frame_shift_ms = 12.5, |
|
It definitely does because I have expanded both the fft size and the number of frequency of linear outputs. So the audio signal process would be affected. That is saying you have to pre-process all the audio dataset and run again from scratch. By the way, these hyper parameters do not match wavenet vocoder but only for Griffin-Lim. |
We can also use L2 loss for linear outputs if you think L2 is better than L1 n_priority_freq = int(4000 / (hp.sample_rate * 0.5) * hp.num_freq)
linear_loss = 0.5 * tf.losses.mean_squared_error(self.linear_targets, self.linear_outputs) \
+ 0.5 * tf.losses.mean_squared_error(self.linear_targets[:,:,0:n_priority_freq], self.linear_outputs[:,:,0:n_priority_freq]) The reason why we prefer L2 is mentioned here #4 (comment) |
I am afraid there might be problems in your dataset. In my test it would achieve convergence in 4K steps when I adopted the solutions mentioned both on the 5th and 8th floor that used MSE for linear loss. |
@begeekmyfriend 为什么你的4000 steps,loss = 0.59就可以看起来如此收敛,而我这边70000 steps,loss = 0.37还没苗条的曲线?... |
@begeekmyfriend my good friend you are correct once more! I have fixed that.. I apologize for the typo :) Thanks for your feedback! |
In the expression computing linear loss,
num_mels
should have beennum_freq
. See Keith Ito's version. It seems that this model does not compute the loss including effective bandwidth of the audio.The text was updated successfully, but these errors were encountered: