Why is my hybrid RNN-T - CTC model training hanging at training logging time? #10670
Unanswered
riqiang-dp
asked this question in
Q&A
Replies: 2 comments 5 replies
-
Hi, -Nithin |
Beta Was this translation helpful? Give feedback.
1 reply
-
Hi @riqiang-dp, I hope you are trying with r2.0.0 and container: nvcr.io/nvidia/nemo:24.07 or nvcr.io/nvidia/nemo:dev . If not could you please try with above environment. We are not seeing this issue on evaluation, so it would be helpful to know more about your environment and training configuration |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Very similar to this issue from years ago: #532
In the sense that, I tried to trace where the hanging happens, and I also got actions.py and backward() when it's stuck and try to make the process print out a traceback.
It also happens at "evaluation time" - it happens when the steps of
log_every_n_steps
happens, so the trainer is decoding some training samples and print out the hypothesis, compute loss and WER.When it hangs, GPU usage goes to 100% but VRAM is not maxed out. It then times out after a few dozen minutes.
When I change the data, it happens at different times during training, for example when I removed some data < 1s from the dataset, it happened at 400 steps instead of 200 steps (where my
log_every_n_steps = 200
), so it successfully evaluated one batch of data but not the next one. So initially I assumed it had to do with data, but then I tried CTC only training and it was going smoothly. Also I noticed that with RNNT+CTC training it prints out more samples during this evaluation / logging step? In CTC training only one sample is printed out where RNNT+CTC prints multiple.Also, I'm training with NeMo toolkit 2.0.0rc1, Pytorch 2.2.2, (numpy < 2.0), multiple GPUs with ddp, same number of CPUs as GPUs for dataloader workers.
I'm not sure what are some other relevant information needed to debug this, please let me know.
Beta Was this translation helpful? Give feedback.
All reactions