Is it missing some activation functions between some layers? #9

BridgetteSong · 2023-07-25T04:03:44Z

Thanks for your work. I have trained model in my own dataset. I met same question as ISSUE7. When I checked the model, I found some difference in AutoEncoder:

Before encoder_output is feeded into Projector, is an activation function needed?
Before ConvTranspose1d, is an activation function needed?
add tanh activation function in Decoder final out?

In other popular implementations, they all added those. So I add those:

add an activation function before

AudioDec/models/autoencoder/modules/projector.py

Line 50 in 9b49838

return self.project(x)
add an activation function before

AudioDec/models/autoencoder/modules/decoder.py

Line 62 in 9b49838

x = self.conv(x)
add an activation function before

AudioDec/models/autoencoder/modules/decoder.py

Line 120 in 9b49838

x = self.conv2(x)
add a tanh() after

AudioDec/models/autoencoder/modules/decoder.py

Line 120 in 9b49838

x = self.conv2(x)

When I added those and trained again, I got some improvement in unseen datasets than your base when I only trained AutoEncoder with discriminators and don't finetune it with AudioDec.
BTW, I trained model only with Librispeech and AIShell with 16K sampling_rate and tested model by another clean TTS dataset with training 160K steps. When my model is finished(total 800k), I will compare final results, upload some demos and share my training config.

BridgetteSong · 2023-07-25T04:10:35Z

demo with 160k steps
demo.zip

bigpon · 2023-07-25T23:01:34Z

Hi,
Thanks for the interesting experiments!
I think it is reasonable to add more nonlinearity to the model to enhance its modeling ability once the training is still stable.
If you have more detailed results in any form (demo page, paper, etc.), please feel free to share them with us and I will update the README to show that adding activation functions will improve the robustness to unseen data.

BridgetteSong · 2023-07-31T07:45:38Z

@bigpon I confirmed changing autoencoder model can improve results. I changed as follows:

it is very important to add activation functions as I said above(highly recommended). In the Encodec paper or SoundStream paper they all add(I guess they all borrowed from MelGAN). I use Snake activation function not ELU or LeakyReLU
it is very important to add WeightNorm Layer, which can ensure training stability and model results significantly(highly recommended).
Appropriately increasing code_dim and model size can improve audio reconstruction quality(lower melloss about 15.3 in my version)(recommended code_dim=128 although I use 256)
I use noncausal training mode, MPD + ComplexMRD as discriminators, MultiMelLoss, trained by AdamW and ExponentialLR
BTW, there are some errors in your MRD because intermediate convolution outputs are missed and can't be computed feature loss.

AudioDec/models/vocoder/modules/discriminator.py

Line 568 in 9b49838

x = f(x)

and missing padding for each Conv2d Layer

AudioDec/models/vocoder/modules/discriminator.py

Line 511 in 9b49838

nn.Conv2d(

Here is my training config.yaml
symAD_librispeech_16000_hop160_base.txt

demos for new config with training 200K steps by using librispeech and aishell datasets, but testing on an unseen dataset

demo.zip

bigpon · 2023-08-09T16:23:51Z

Hi @BridgetteSong,

Thanks for the great efforts of investigation! I will check the results of 48kHz VCTK corpus.
Do you have any plan to write a paper about your findings? If you write any paper, please inform me, and I will add the info to the README for others' reference.
You are correct. The MRD actually has these problems, and I will fix them.
Where do you put the WeightNorm layers?
Could you also provide the results of the original AudioDec for references? (I assume the audiodec results in demo.zip are the modified version, right?)
According to your conclusion, will these modifications increase the quality for arbitrary dataset? or the robustness of unseen dataset? Since you train and test the model using libritts and aishell, I assume that these modifications will increase the reconstruction quality for seen data, right?

BridgetteSong · 2023-08-10T02:50:53Z

@bigpon

I don't have the idea of writing a paper yet, but I'm interested in improving the effect of Encodec.
I add WeightNorm on each Conv1d layer like this:

AudioDec/layers/conv_layer.py

Line 46 in 9b49838

self.conv = nn.Conv1d(

self.conv = torch.nn.utils.weight_norm(nn.Conv1d(**))

I think stability of AutoEncoder is very important to train two stages or just only train one stage same as mine.

BTW, I see WeightNorm are added in 2nd stage by default using apply() function. But I can't confirm Whether the WeightNorm initialization of ResidualBlock is successful by using apply() function. So I directly use self.conv = torch.nn.utils.weight_norm(nn.Conv1d(**)) as above.
I trained model by libritts and aishell dataset, but test on an unseen dataset(audios in demo.zip are unseen which are from another TTS dataset, even includes a singing demo), so these modifications can increase the quality for arbitrary dataset
I can't provide the results of the original AudioDec because the model is changed, the demo.zip is the modified version. but demo.zip hifi dir contains of the 16k and 24k original audios, if someone has trained an original AudioDec model, just use real audios in demo.zip to test.

1. According to issue #9, we implement the codec version (activate_audiodec) with more activations like HiFiGAN and release the pre-trained model “symAAD_vctk_48000_hop300”. 2. We fix the MSTFT 2D conv padding issues mentioned in issue #9 and release the updated “symADuniv_vctk_48000_hop300” and “AudioDec_v3_symADuniv_vctk_48000_hop300_clean”. 3. We implement the more flexible CausalConvTranspose1d padding for arbitrary kernel_size and stride according to issue #11. 4. We release a 24kbps model, “symAD_c16_vctk_48000_hop320”, which achieves better speech quality and robustness to unseen data.

bigpon · 2024-01-03T21:54:19Z

Hi,
Thanks for your investigation!

According to our internal experiments, we get some conclusions.

Adding more activation functions like HiFiGAN will slightly increase the unseen data robustness. However, it is very similar to our 2-stage approach, which already used HiFiGAN as the decoder.
The snake activation doesn’t show marked improvements over the ELU activation. In some cases, the snake activation even achieves much worse speech quality. We think the instability of the snake activation might cause the problem.
Instand of adding activations, we found that increasing the bitrate to a reasonable scale (ex: 24kbps as Opus) will significantly improve the unseen data robustness, which somehow makes sense since it reduces the modeling difficulties. However, the very low bitrate feature is essential for some temporal-sensitive tasks such as LLM-based speech generation. Therefore, without greatly changing the architecture, adopting more training data will be a compromise. (We are investigating a new architecture for unseen data robustness and hope to release it soon.)

On the other hand, the 2D conv padding issue of the MSTFT discriminator has been fixed, and the corresponding models have been updated. Thanks for your contributions again.

a897456 · 2024-06-24T03:19:24Z

The snake activation doesn’t show marked improvements over the ELU activation. In some cases, the snake activation even achieves much worse speech quality. We think the instability of the snake activation might cause the problem.

@bigpon Hi bigpon,
I found that someone published a paper based on the soundstream framework, which mentioned the benefits of snake activation function as an innovation point, but this is not consistent with your conclusion, so I don't know what went wrong

bigpon · 2024-06-24T16:01:35Z

According to DAC, they claimed that snake is much better.
However, in AudioDec architecture, we didn't find the tendency.
Two possible reasons,

Snake is sensitive to initialization and training, so we might not optimize the training processing of AudioDec with sanke (e.g. we didn't apply layer normalization, etc.)
Snake is better than Leaky ReLU but we use ELU here.

Since we gave it a quick try without carefully tuning the hyperparameters, further investigations are required.

bigpon added the enhancement New feature or request label Jul 25, 2023

bigpon closed this as completed Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it missing some activation functions between some layers? #9

Is it missing some activation functions between some layers? #9

BridgetteSong commented Jul 25, 2023 •

edited

Loading

BridgetteSong commented Jul 25, 2023 •

edited

Loading

bigpon commented Jul 25, 2023

BridgetteSong commented Jul 31, 2023 •

edited

Loading

bigpon commented Aug 9, 2023 •

edited

Loading

BridgetteSong commented Aug 10, 2023 •

edited

Loading

bigpon commented Jan 3, 2024

a897456 commented Jun 24, 2024

bigpon commented Jun 24, 2024

Is it missing some activation functions between some layers? #9

Is it missing some activation functions between some layers? #9

Comments

BridgetteSong commented Jul 25, 2023 • edited Loading

BridgetteSong commented Jul 25, 2023 • edited Loading

bigpon commented Jul 25, 2023

BridgetteSong commented Jul 31, 2023 • edited Loading

bigpon commented Aug 9, 2023 • edited Loading

BridgetteSong commented Aug 10, 2023 • edited Loading

bigpon commented Jan 3, 2024

a897456 commented Jun 24, 2024

bigpon commented Jun 24, 2024

BridgetteSong commented Jul 25, 2023 •

edited

Loading

BridgetteSong commented Jul 25, 2023 •

edited

Loading

BridgetteSong commented Jul 31, 2023 •

edited

Loading

bigpon commented Aug 9, 2023 •

edited

Loading

BridgetteSong commented Aug 10, 2023 •

edited

Loading