Metrics API when using DDP and multi-GPU freezes on compute() at end of validation phase #5930

angadkalra · 2021-02-11T22:41:06Z

🐛 Bug

Implemented AUC metric class to calculate train/valid AUC per epoch, but my progress bar freezes at end of first epoch with GPUs at 100%. It works with 1 GPU, but not more. I basically copied the source code from metric ExplainedVariance
but it doesn't work in DDP with multi-gpus for me. The bug happens after the return in compute() because print statements in compute() successfully print the preds and targets variables.

I'm training ResNet101 on 2700 3D images stored as .npy files.

import torch
from pytorch_lightning.metrics import Metric
from pytorch_lightning.metrics.functional.classification import multiclass_auroc


class AUC(Metric):
    def __init__(self, dist_sync_on_step=False):
        super().__init__(compute_on_step=False, dist_sync_on_step=dist_sync_on_step)

        self.add_state("preds", default=[], dist_reduce_fx=None)
        self.add_state("targets", default=[], dist_reduce_fx=None)

    def update(self, preds: torch.Tensor, targets: torch.Tensor):
        self.preds.append(preds)
        self.targets.append(targets)

    def compute(self):
        preds = torch.cat(self.preds)
        targets = torch.cat(self.targets)
        return multiclass_auroc(preds, targets)

PyTorch Version (e.g., 1.0): 1.7.1+cu101
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.7.6
CUDA/cuDNN version: 10.1
GPU models and configuration: 4 V100 on google cloud VM
Any other relevant information: 32 cores, 128GB mem
Pytorch Lightning Version: 1.1.8

Additional context

The text was updated successfully, but these errors were encountered:

kagglesintracking · 2021-02-12T05:35:51Z

Hi, I am experiencing the same problem. My model breaks at epoch 5. It freezes and I cannot interrupt. I had to kill the terminal. This only happens when DDP with multiple GPUs; single GPU works fine.

kagglesintracking · 2021-02-12T05:49:54Z

I fixed it by pip install pytorch-lightning==1.1.1. I think it must be sth to do with the current 1.1.18 version.

angadkalra · 2021-02-12T05:58:22Z

Oh wow, okay, thanks! Will try.

angadkalra · 2021-02-12T06:03:44Z

@SkafteNicki Any idea why?

SkafteNicki · 2021-02-12T09:44:12Z

So the only major change we have done since v1.1.1 (that should influence this) is removing the default reset that normally happens after compute is called: #5409. I am not sure right now how that would lead to a deadlock in ddp.
Could one of you provide me a full script so I can reproduce the error (simple model on some dummy data)?

angadkalra · 2021-02-12T17:17:58Z

I rolled back last night to v1.1.1 and it ran for 20 epochs fast before freezing in middle of train epoch 21. I figured out that it's definitely happening because of the multiclass_auroc call in v1.1.8. Whenever I comment that out and only track loss, it runs smooth and fast.

angadkalra · 2021-02-12T17:18:31Z

So the only major change we have done since v1.1.1 (that should influence this) is removing the default reset that normally happens after compute is called: #5409. I am not sure right now how that would lead to a deadlock in ddp.
Could one of you provide me a full script so I can reproduce the error (simple model on some dummy data)?

I'll try

angadkalra · 2021-02-12T20:22:44Z

One other thing I noticed, when I put breakpoint() in my code and run, it works and I can step through, but without pdb it freezes...

SkafteNicki · 2021-02-15T14:18:28Z

@angadkalra just to be clear:

are you experiencing problems with both 1.1.1 and 1.1.8?
is it only multiclass_auroc or also other metrics?

angadkalra · 2021-02-15T17:47:42Z

@angadkalra just to be clear:

are you experiencing problems with both 1.1.1 and 1.1.8?

is it only multiclass_auroc or also other metrics?

Experiencing problems in both, but A LOT less in 1.1.1. Things actually run for the entire epochs and at a good speed. In 1.1.8, it never did more than a few epochs before freezing/hanging.

I tried only multiclass_auroc and auroc and it caused problems.

tchaton · 2021-02-15T17:59:55Z

Hey @angadkalra @SkafteNicki,

Any update on this one ?

Best,
T.C

kagglesintracking · 2021-02-15T18:13:34Z

I used accuracy from pytorch lightning. I will publish my source code soon.

kagglesintracking · 2021-02-15T20:29:04Z

link to my code. Dead lock if using the latest pytorch lightning ddp on RTX3090 x 2. Rolled back to v1.1.1, everything works fine. I used from pytorch_lightning.metrics.functional import accuracy in my code. I am not sure if my implementation is 100% correct; I am also new to pytorch lightning, mainly using this for ddp and batch norm sync.

angadkalra · 2021-02-16T18:20:28Z

Hey @angadkalra @SkafteNicki,

Any update on this one ?

Best,
T.C

Currently, Metric API does not work for me in v1.1.1 or v1.1.8 when using multi-gpu. Using 1 gpu works fine.

Borda · 2021-02-22T14:51:54Z

@angadkalra mind check the latest 1.2.0

angadkalra · 2021-02-22T19:51:13Z

@angadkalra mind check the latest 1.2.0

@Borda I upgraded from v1.1.1 to v1.2.0, didn't change any of my code, tried running in same working state as 1.1.1, and I keep getting CUDA out of memory error. I'm using batch size of 2 per gpu (4 gpus). I downgraded back to 1.1.1 and it works perfectly. Any idea?

angadkalra · 2021-02-22T22:57:55Z

@carmocca Any thoughts on comment above?

carmocca · 2021-02-22T23:19:09Z

Currently, Metric API does not work for me in v1.1.1 or v1.1.8 when using multi-gpu

I downgraded back to 1.1.1 and it works perfectly

Does it work or not with 1.1.1?

Is your issue and @kagglesintracking's the same? Does @kagglesintracking's snippet (https://github.com/kagglesintracking/kaggle-Cassava-Leaf-Disease-Classification/blob/main/src/main.py) also reproduce your issue?

To maximize your chances of us solving this as soon as possible, a fully working reproduction snippet would be awesome. Ideally using the BoringModel. The data can be random.

You can use this link as a template (https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing). Logically, you will not be able to reproduce the bug in colab as they don't provide multiple GPUs. But you can check it reproduces locally and share with us the updated colab snippet so we can try it locally too.

angadkalra · 2021-02-23T00:02:08Z

@carmocca Metrics API does not work with v1.1.1 or v1.1.8, but returning tensors in dict during step function works. When I upgraded to v1.2, nothing works, not even returning tensors in dict. I didn't change any code and it gives OOM in v1.2. When I downgrade to v1.1.8, it runs.

edenlightning · 2021-02-23T21:43:51Z

@angadkalra can you add a reproducible example?

angadkalra · 2021-02-23T21:51:18Z

@angadkalra can you add a reproducible example?

This is my code that I'm running locally: https://colab.research.google.com/drive/1jW559Uc4_Y7TSMvKVlhnX1vYe99XDMbD?usp=sharing

If you want to see v1.1.1 code: https://colab.research.google.com/drive/1VaZsG5O9EfdfaZcHjJWYhv8LmqpMNA7x?usp=sharing

Running v1.1.1 colab, it runs perfectly AND I can hit run repeatedly and it'll work. The moment you upgrade to v1.2, it breaks and gives weird errors.

EDIT:
So it looks like num_classes for AUROC metric is needed for multiclass. Otherwise an error is thrown.

SkafteNicki · 2021-02-24T10:48:51Z

@angadkalra I tried both running your colab notebook with v1.2 and that seems to work fine for me.
Also copied the code to a local script and ran in multi-gpu setting. It also completed using v1.2.
Am I missing something?
Does it only happen after some number of epochs or?

angadkalra · 2021-02-24T19:27:45Z

@SkafteNicki I think we're good! I got AUROC metric working locally on multiple GPUs, but seems like v1.2 uses more GPU memory. I had to reduce my data image size. Any idea?

SkafteNicki · 2021-02-25T08:49:11Z

@angadkalra from your notebook it seems you are using 16-precision for training right?
I know there was a problem with it not working as it should, but should have been fixed by PR #6080 which is included in v1.2.1 (released some hours ago). Could you try upgrading one more time to see if this fix your last problem.

angadkalra · 2021-02-25T18:23:55Z

@angadkalra from your notebook it seems you are using 16-precision for training right?
I know there was a problem with it not working as it should, but should have been fixed by PR #6080 which is included in v1.2.1 (released some hours ago). Could you try upgrading one more time to see if this fix your last problem.

Yup you're right, fixed! Thanks so much everyone!

adipraja · 2021-08-24T04:16:32Z

Hi guys, I'd like to re-raise the issue if possible. I have the same problem with the same setting as @kagglesintracking: Dead lock if using the latest pytorch lightning ddp on RTX3090 x 2

I'm using pytorch-lightning==1.3.2 and torchmetrics==0.5.0

DonkeyShot21 · 2021-09-17T09:14:20Z

I am also facing the same issue with pytorch-lightning 1.3.8 and torchmetrics 0.4.0. GPUs are stuck with 100% util.

EDIT: I am using two Quadro RTX 5000

carmocca · 2021-09-17T14:39:52Z

Try the latest versions and if the problem persists, open a new issue

DonkeyShot21 · 2021-09-17T14:45:53Z

I made it work with pytorch-lightning 1.4+ and torchmetrics 0.5+

manavkulshrestha · 2024-08-23T21:06:29Z

I'm having this exact same issue with pytorch-lightning 2.4.0 and torchmetrics 1.4.1. No resolution?

Wojtechnology · 2024-11-01T00:34:51Z

Also having this issue with lightning 2.2.1 and torchmetrics 1.3.1.

angadkalra added bug Something isn't working help wanted Open to be worked on labels Feb 11, 2021

SkafteNicki added the metrics label Feb 12, 2021

tchaton added the priority: 0 High priority task label Feb 15, 2021

carmocca added this to the 1.2.x milestone Feb 22, 2021

angadkalra closed this as completed Feb 25, 2021

manavkulshrestha mentioned this issue Aug 23, 2024

metric.compute() hangs when using DDP with multiple GPUs #20223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics API when using DDP and multi-GPU freezes on compute() at end of validation phase #5930

Metrics API when using DDP and multi-GPU freezes on compute() at end of validation phase #5930

angadkalra commented Feb 11, 2021 •

edited

Loading

kagglesintracking commented Feb 12, 2021

kagglesintracking commented Feb 12, 2021

angadkalra commented Feb 12, 2021

angadkalra commented Feb 12, 2021

SkafteNicki commented Feb 12, 2021

angadkalra commented Feb 12, 2021

angadkalra commented Feb 12, 2021

angadkalra commented Feb 12, 2021

SkafteNicki commented Feb 15, 2021

angadkalra commented Feb 15, 2021

tchaton commented Feb 15, 2021

kagglesintracking commented Feb 15, 2021 •

edited

Loading

kagglesintracking commented Feb 15, 2021

angadkalra commented Feb 16, 2021

Borda commented Feb 22, 2021

angadkalra commented Feb 22, 2021

angadkalra commented Feb 22, 2021

carmocca commented Feb 22, 2021

angadkalra commented Feb 23, 2021

edenlightning commented Feb 23, 2021

angadkalra commented Feb 23, 2021 •

edited

Loading

SkafteNicki commented Feb 24, 2021

angadkalra commented Feb 24, 2021

SkafteNicki commented Feb 25, 2021

angadkalra commented Feb 25, 2021

adipraja commented Aug 24, 2021

DonkeyShot21 commented Sep 17, 2021 •

edited

Loading

carmocca commented Sep 17, 2021

DonkeyShot21 commented Sep 17, 2021

manavkulshrestha commented Aug 23, 2024 •

edited

Loading

Wojtechnology commented Nov 1, 2024

Metrics API when using DDP and multi-GPU freezes on compute() at end of validation phase #5930

Metrics API when using DDP and multi-GPU freezes on compute() at end of validation phase #5930

Comments

angadkalra commented Feb 11, 2021 • edited Loading

🐛 Bug

Additional context

kagglesintracking commented Feb 12, 2021

kagglesintracking commented Feb 12, 2021

angadkalra commented Feb 12, 2021

angadkalra commented Feb 12, 2021

SkafteNicki commented Feb 12, 2021

angadkalra commented Feb 12, 2021

angadkalra commented Feb 12, 2021

angadkalra commented Feb 12, 2021

SkafteNicki commented Feb 15, 2021

angadkalra commented Feb 15, 2021

tchaton commented Feb 15, 2021

kagglesintracking commented Feb 15, 2021 • edited Loading

kagglesintracking commented Feb 15, 2021

angadkalra commented Feb 16, 2021

Borda commented Feb 22, 2021

angadkalra commented Feb 22, 2021

angadkalra commented Feb 22, 2021

carmocca commented Feb 22, 2021

angadkalra commented Feb 23, 2021

edenlightning commented Feb 23, 2021

angadkalra commented Feb 23, 2021 • edited Loading

SkafteNicki commented Feb 24, 2021

angadkalra commented Feb 24, 2021

SkafteNicki commented Feb 25, 2021

angadkalra commented Feb 25, 2021

adipraja commented Aug 24, 2021

DonkeyShot21 commented Sep 17, 2021 • edited Loading

carmocca commented Sep 17, 2021

DonkeyShot21 commented Sep 17, 2021

manavkulshrestha commented Aug 23, 2024 • edited Loading

Wojtechnology commented Nov 1, 2024

angadkalra commented Feb 11, 2021 •

edited

Loading

kagglesintracking commented Feb 15, 2021 •

edited

Loading

angadkalra commented Feb 23, 2021 •

edited

Loading

DonkeyShot21 commented Sep 17, 2021 •

edited

Loading

manavkulshrestha commented Aug 23, 2024 •

edited

Loading