-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics API when using DDP and multi-GPU freezes on compute() at end of validation phase #5930
Comments
Hi, I am experiencing the same problem. My model breaks at epoch 5. It freezes and I cannot interrupt. I had to kill the terminal. This only happens when DDP with multiple GPUs; single GPU works fine. |
I fixed it by |
Oh wow, okay, thanks! Will try. |
@SkafteNicki Any idea why? |
So the only major change we have done since v1.1.1 (that should influence this) is removing the default |
I rolled back last night to v1.1.1 and it ran for 20 epochs fast before freezing in middle of train epoch 21. I figured out that it's definitely happening because of the multiclass_auroc call in v1.1.8. Whenever I comment that out and only track loss, it runs smooth and fast. |
I'll try |
One other thing I noticed, when I put breakpoint() in my code and run, it works and I can step through, but without pdb it freezes... |
@angadkalra just to be clear:
|
Experiencing problems in both, but A LOT less in 1.1.1. Things actually run for the entire epochs and at a good speed. In 1.1.8, it never did more than a few epochs before freezing/hanging. I tried only multiclass_auroc and auroc and it caused problems. |
Hey @angadkalra @SkafteNicki, Any update on this one ? Best, |
I used accuracy from pytorch lightning. I will publish my source code soon. |
link to my code. Dead lock if using the latest pytorch lightning ddp on RTX3090 x 2. Rolled back to v1.1.1, everything works fine. I used |
Currently, Metric API does not work for me in v1.1.1 or v1.1.8 when using multi-gpu. Using 1 gpu works fine. |
@angadkalra mind check the latest 1.2.0 |
@Borda I upgraded from v1.1.1 to v1.2.0, didn't change any of my code, tried running in same working state as 1.1.1, and I keep getting CUDA out of memory error. I'm using batch size of 2 per gpu (4 gpus). I downgraded back to 1.1.1 and it works perfectly. Any idea? |
@carmocca Any thoughts on comment above? |
Does it work or not with 1.1.1? Is your issue and @kagglesintracking's the same? Does @kagglesintracking's snippet (https://github.com/kagglesintracking/kaggle-Cassava-Leaf-Disease-Classification/blob/main/src/main.py) also reproduce your issue? To maximize your chances of us solving this as soon as possible, a fully working reproduction snippet would be awesome. Ideally using the You can use this link as a template (https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing). Logically, you will not be able to reproduce the bug in colab as they don't provide multiple GPUs. But you can check it reproduces locally and share with us the updated colab snippet so we can try it locally too. |
@carmocca Metrics API does not work with v1.1.1 or v1.1.8, but returning tensors in dict during step function works. When I upgraded to v1.2, nothing works, not even returning tensors in dict. I didn't change any code and it gives OOM in v1.2. When I downgrade to v1.1.8, it runs. |
@angadkalra can you add a reproducible example? |
This is my code that I'm running locally: https://colab.research.google.com/drive/1jW559Uc4_Y7TSMvKVlhnX1vYe99XDMbD?usp=sharing If you want to see v1.1.1 code: https://colab.research.google.com/drive/1VaZsG5O9EfdfaZcHjJWYhv8LmqpMNA7x?usp=sharing Running v1.1.1 colab, it runs perfectly AND I can hit run repeatedly and it'll work. The moment you upgrade to v1.2, it breaks and gives weird errors. EDIT: |
@angadkalra I tried both running your colab notebook with v1.2 and that seems to work fine for me. |
@SkafteNicki I think we're good! I got AUROC metric working locally on multiple GPUs, but seems like v1.2 uses more GPU memory. I had to reduce my data image size. Any idea? |
@angadkalra from your notebook it seems you are using 16-precision for training right? |
Yup you're right, fixed! Thanks so much everyone! |
Hi guys, I'd like to re-raise the issue if possible. I have the same problem with the same setting as @kagglesintracking: I'm using |
I am also facing the same issue with EDIT: I am using two |
Try the latest versions and if the problem persists, open a new issue |
I made it work with |
I'm having this exact same issue with pytorch-lightning 2.4.0 and torchmetrics 1.4.1. No resolution? |
Also having this issue with lightning 2.2.1 and torchmetrics 1.3.1. |
🐛 Bug
Implemented AUC metric class to calculate train/valid AUC per epoch, but my progress bar freezes at end of first epoch with GPUs at 100%. It works with 1 GPU, but not more. I basically copied the source code from metric ExplainedVariance
but it doesn't work in DDP with multi-gpus for me. The bug happens after the return in compute() because print statements in compute() successfully print the preds and targets variables.
I'm training ResNet101 on 2700 3D images stored as .npy files.
Additional context
The text was updated successfully, but these errors were encountered: