Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics API when using DDP and multi-GPU freezes on compute() at end of validation phase #5930

Closed
angadkalra opened this issue Feb 11, 2021 · 31 comments
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@angadkalra
Copy link

angadkalra commented Feb 11, 2021

🐛 Bug

Implemented AUC metric class to calculate train/valid AUC per epoch, but my progress bar freezes at end of first epoch with GPUs at 100%. It works with 1 GPU, but not more. I basically copied the source code from metric ExplainedVariance
but it doesn't work in DDP with multi-gpus for me. The bug happens after the return in compute() because print statements in compute() successfully print the preds and targets variables.

I'm training ResNet101 on 2700 3D images stored as .npy files.

import torch
from pytorch_lightning.metrics import Metric
from pytorch_lightning.metrics.functional.classification import multiclass_auroc


class AUC(Metric):
    def __init__(self, dist_sync_on_step=False):
        super().__init__(compute_on_step=False, dist_sync_on_step=dist_sync_on_step)

        self.add_state("preds", default=[], dist_reduce_fx=None)
        self.add_state("targets", default=[], dist_reduce_fx=None)

    def update(self, preds: torch.Tensor, targets: torch.Tensor):
        self.preds.append(preds)
        self.targets.append(targets)

    def compute(self):
        preds = torch.cat(self.preds)
        targets = torch.cat(self.targets)
        return multiclass_auroc(preds, targets)
  • PyTorch Version (e.g., 1.0): 1.7.1+cu101
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.7.6
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: 4 V100 on google cloud VM
  • Any other relevant information: 32 cores, 128GB mem
  • Pytorch Lightning Version: 1.1.8

Additional context

@angadkalra angadkalra added bug Something isn't working help wanted Open to be worked on labels Feb 11, 2021
@kagglesintracking
Copy link

Hi, I am experiencing the same problem. My model breaks at epoch 5. It freezes and I cannot interrupt. I had to kill the terminal. This only happens when DDP with multiple GPUs; single GPU works fine.

@kagglesintracking
Copy link

I fixed it by pip install pytorch-lightning==1.1.1. I think it must be sth to do with the current 1.1.18 version.

@angadkalra
Copy link
Author

Oh wow, okay, thanks! Will try.

@angadkalra
Copy link
Author

@SkafteNicki Any idea why?

@SkafteNicki
Copy link
Member

So the only major change we have done since v1.1.1 (that should influence this) is removing the default reset that normally happens after compute is called: #5409. I am not sure right now how that would lead to a deadlock in ddp.
Could one of you provide me a full script so I can reproduce the error (simple model on some dummy data)?

@angadkalra
Copy link
Author

I rolled back last night to v1.1.1 and it ran for 20 epochs fast before freezing in middle of train epoch 21. I figured out that it's definitely happening because of the multiclass_auroc call in v1.1.8. Whenever I comment that out and only track loss, it runs smooth and fast.

@angadkalra
Copy link
Author

So the only major change we have done since v1.1.1 (that should influence this) is removing the default reset that normally happens after compute is called: #5409. I am not sure right now how that would lead to a deadlock in ddp.
Could one of you provide me a full script so I can reproduce the error (simple model on some dummy data)?

I'll try

@angadkalra
Copy link
Author

One other thing I noticed, when I put breakpoint() in my code and run, it works and I can step through, but without pdb it freezes...

@SkafteNicki
Copy link
Member

@angadkalra just to be clear:

  • are you experiencing problems with both 1.1.1 and 1.1.8?
  • is it only multiclass_auroc or also other metrics?

@angadkalra
Copy link
Author

@angadkalra just to be clear:

  • are you experiencing problems with both 1.1.1 and 1.1.8?
  • is it only multiclass_auroc or also other metrics?

Experiencing problems in both, but A LOT less in 1.1.1. Things actually run for the entire epochs and at a good speed. In 1.1.8, it never did more than a few epochs before freezing/hanging.

I tried only multiclass_auroc and auroc and it caused problems.

@tchaton
Copy link
Contributor

tchaton commented Feb 15, 2021

Hey @angadkalra @SkafteNicki,

Any update on this one ?

Best,
T.C

@tchaton tchaton added the priority: 0 High priority task label Feb 15, 2021
@kagglesintracking
Copy link

kagglesintracking commented Feb 15, 2021

I used accuracy from pytorch lightning. I will publish my source code soon.

@kagglesintracking
Copy link

link to my code. Dead lock if using the latest pytorch lightning ddp on RTX3090 x 2. Rolled back to v1.1.1, everything works fine. I used from pytorch_lightning.metrics.functional import accuracy in my code. I am not sure if my implementation is 100% correct; I am also new to pytorch lightning, mainly using this for ddp and batch norm sync.

@angadkalra
Copy link
Author

Hey @angadkalra @SkafteNicki,

Any update on this one ?

Best,
T.C

Currently, Metric API does not work for me in v1.1.1 or v1.1.8 when using multi-gpu. Using 1 gpu works fine.

@Borda
Copy link
Member

Borda commented Feb 22, 2021

@angadkalra mind check the latest 1.2.0

@angadkalra
Copy link
Author

@angadkalra mind check the latest 1.2.0

@Borda I upgraded from v1.1.1 to v1.2.0, didn't change any of my code, tried running in same working state as 1.1.1, and I keep getting CUDA out of memory error. I'm using batch size of 2 per gpu (4 gpus). I downgraded back to 1.1.1 and it works perfectly. Any idea?

@angadkalra
Copy link
Author

@carmocca Any thoughts on comment above?

@carmocca
Copy link
Contributor

Currently, Metric API does not work for me in v1.1.1 or v1.1.8 when using multi-gpu

I downgraded back to 1.1.1 and it works perfectly

Does it work or not with 1.1.1?

Is your issue and @kagglesintracking's the same? Does @kagglesintracking's snippet (https://github.com/kagglesintracking/kaggle-Cassava-Leaf-Disease-Classification/blob/main/src/main.py) also reproduce your issue?

To maximize your chances of us solving this as soon as possible, a fully working reproduction snippet would be awesome. Ideally using the BoringModel. The data can be random.

You can use this link as a template (https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing). Logically, you will not be able to reproduce the bug in colab as they don't provide multiple GPUs. But you can check it reproduces locally and share with us the updated colab snippet so we can try it locally too.

@carmocca carmocca added this to the 1.2.x milestone Feb 22, 2021
@angadkalra
Copy link
Author

@carmocca Metrics API does not work with v1.1.1 or v1.1.8, but returning tensors in dict during step function works. When I upgraded to v1.2, nothing works, not even returning tensors in dict. I didn't change any code and it gives OOM in v1.2. When I downgrade to v1.1.8, it runs.

@edenlightning
Copy link
Contributor

@angadkalra can you add a reproducible example?

@angadkalra
Copy link
Author

angadkalra commented Feb 23, 2021

@angadkalra can you add a reproducible example?

This is my code that I'm running locally: https://colab.research.google.com/drive/1jW559Uc4_Y7TSMvKVlhnX1vYe99XDMbD?usp=sharing

If you want to see v1.1.1 code: https://colab.research.google.com/drive/1VaZsG5O9EfdfaZcHjJWYhv8LmqpMNA7x?usp=sharing

Running v1.1.1 colab, it runs perfectly AND I can hit run repeatedly and it'll work. The moment you upgrade to v1.2, it breaks and gives weird errors.

EDIT:
So it looks like num_classes for AUROC metric is needed for multiclass. Otherwise an error is thrown.

@SkafteNicki
Copy link
Member

@angadkalra I tried both running your colab notebook with v1.2 and that seems to work fine for me.
Also copied the code to a local script and ran in multi-gpu setting. It also completed using v1.2.
Am I missing something?
Does it only happen after some number of epochs or?

@angadkalra
Copy link
Author

@SkafteNicki I think we're good! I got AUROC metric working locally on multiple GPUs, but seems like v1.2 uses more GPU memory. I had to reduce my data image size. Any idea?

@SkafteNicki
Copy link
Member

@angadkalra from your notebook it seems you are using 16-precision for training right?
I know there was a problem with it not working as it should, but should have been fixed by PR #6080 which is included in v1.2.1 (released some hours ago). Could you try upgrading one more time to see if this fix your last problem.

@angadkalra
Copy link
Author

@angadkalra from your notebook it seems you are using 16-precision for training right?
I know there was a problem with it not working as it should, but should have been fixed by PR #6080 which is included in v1.2.1 (released some hours ago). Could you try upgrading one more time to see if this fix your last problem.

Yup you're right, fixed! Thanks so much everyone!

@adipraja
Copy link

Hi guys, I'd like to re-raise the issue if possible. I have the same problem with the same setting as @kagglesintracking: Dead lock if using the latest pytorch lightning ddp on RTX3090 x 2

I'm using pytorch-lightning==1.3.2 and torchmetrics==0.5.0

@DonkeyShot21
Copy link

DonkeyShot21 commented Sep 17, 2021

I am also facing the same issue with pytorch-lightning 1.3.8 and torchmetrics 0.4.0. GPUs are stuck with 100% util.

EDIT: I am using two Quadro RTX 5000

@carmocca
Copy link
Contributor

Try the latest versions and if the problem persists, open a new issue

@DonkeyShot21
Copy link

I made it work with pytorch-lightning 1.4+ and torchmetrics 0.5+

@manavkulshrestha
Copy link

manavkulshrestha commented Aug 23, 2024

I'm having this exact same issue with pytorch-lightning 2.4.0 and torchmetrics 1.4.1. No resolution?

@Wojtechnology
Copy link

Also having this issue with lightning 2.2.1 and torchmetrics 1.3.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

No branches or pull requests