-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
g_loss is None in second stage training #11
Comments
When did this happen? Is it before or after diffusion model training? Is this before or after SLM adversarial training? I have noticed it happens several times myself which is why I put a
|
It is within the first epoch of training under train_second.py. I am doing it with LJspeech. |
What is your config? With the settings in this repo, I don't have this issue, so it's probably related to stuff like learning rate and batch size etc. |
Also, check if your first stage model has reasonable quality in reconstruction in tensorboard. It should be perceptually indistinguishable from the ground truth, otherwise something is wrong with your first stage too. |
I kept most of your config, except that I increased the batch size and learning rate since I use 8 GPUs with larger memory. I set batch_size: 48 and increased lr by 3 times. The reconstruction you mean the audio right, I checked the audio in eval tab, I feel those are good. |
You should not increase the learning rate by 3 times, especially for PL-BERT; I believe this is where the problem is. I suggest you keep the learning rate unchanged even though you have a higher batch size. The highest batch size I have tried was 32 but I used the same learning rate. The demo samples on styletts2.github.io were generated using the model trained with a batch size of 32 with the exact same learning rate (they are slightly different from the one trained with a batch size of 16, but the quality is pretty much the same). The following is the learning curve I have for the first stage model. If this is what you see in your tensorboard too, it should be fine. The loss increase is mostly caused by feature matching loss, as the features are getting more and more hard to catch because the discriminator is overfitting. You can see figure 3 of https://dl.acm.org/doi/pdf/10.1145/3573834.3574506, this is normal. |
Thank you for the knowledge sharing, I think my stage 1 training loss trajectory plot looks good based on the comparison. Trying what you suggested, so far no issues are shown in the first several epochs. Will continue the model training, and keep you posted. Thank you again. |
Hi, I found the same issue happen again in the 9th epoch of second-stage training. The loss_mel is Nan. I use a batch size of 32 with 8 GPUs, and others are the same as your config. |
This is so weird. Can you try to lower it to 16 instead? Does it still happen if the batch size is 16? |
Any update on batch size 16? Or is it because you used a different learning rate for the first stage model? |
In the second stage of training, I kept the batch to 16, and the Nan issues are not shown again with 8 GPU training.
The issue happens during the backward for loss_gen_lm. my pytorch version is 2.1.0 |
This is likely caused by having too many GPUs but too few samples in a batch. Can you change |
Hi, the error still exists after setting the batch_percentage to 1. Line 488 in 21f7cb9
|
Your errors are so weird. I think it works all fine for me. Can you use 4 GPUs instead of 8? Or is it related to the CUDA version? |
Or I guess this codebase probably has some bugs with PyTorch because it has several weird issues like predictor_encoder.train() makes the F0 loss higher, it has high frequency background noise for old GPUs, it causes NaN with batch size 32 etc. I hope someone can reimplement everything because there’s probably something wrong in my code. The training pipeline was all written by myself instead of modified from some existing codebase (except a few modules like iSTFTNet, diffusion models etc.), so weird glitches are very likely. |
Hi, thank you for sharing your concerns. I don't think this is related to GPU, since after I set a breakpoint, I found when the d_loss_slm is non-zero, and loss_gen_lm is non-zero, the error will happen. When the d_loss_slm is 0, it works without errors. I guess it should be related to the two times backward(). |
Does it cause different behavior though? |
Add a webui for Inference (need gradio)
前処理のpyloudnormの正規化時に一部音源でエラーが出てもスキップ
Also wheni tried to train the second stage from the checkpoint on Hugginface it worked fine. One thing i noticed was that the checkpoint is trained from scrathed is about 1.7gb but the one on huggingface is about 700mb. Am I doing something wring with the training in the stage 1 or you are not saving the discriminator in the checkpoint on huggingface? |
@yl4579 Could you please share your loss chart for the diffusion and duration losses? My model's diffusion doesn't seem to be decreasing and I'm curious what a successful run's diffusion loss looks like. |
Thank you for the code and work.
I'm trying to run the second stage training, and step into the breakpoint part since the g_loss is None, any thoughts on that?
StyleTTS2/train_second.py
Line 450 in fd3884b
The text was updated successfully, but these errors were encountered: