Training stuck at 0% after few epochs while training with DDP #5865

HareshKarnan · 2021-02-07T23:48:46Z

🐛 Bug

I recently updated to pytorch_lightning 1.1.7 and noticed that after a few epochs of training, the training % is stuck at 0% and never progresses. When I switched back to 1.1.4, this strange behavior does not occur. I do not know the root cause of this issue.

PyTorch Version (e.g., 1.0): 1.1.7
OS (e.g., Linux): Linux, U18
How you installed PyTorch (conda, pip, source): pip install
Build command you used (if compiling from source):
Python version: 3.6
CUDA/cuDNN version: 10.0
GPU models and configuration: RTX 2080 x3
Any other relevant information:

The text was updated successfully, but these errors were encountered:

ndrplz · 2021-02-08T09:52:30Z

Related? I'm facing a similar issue, not sure if those might help

Code stuck on "initalizing ddp" when using more than one gpu #4612
NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.4.8 pytorch/pytorch#38702

matyushinleonid · 2021-02-08T11:43:16Z

Hi, @HareshKarnan @ndrplz. Folks, may I ask you to run your pipelines with NCCL_ASYNC_ERROR_HANDLING=1?
Seems like we have the same problem. I use 1.2.0.dev0 lightning version, so this issue may migrate to the new release... My error trace:

https://pastebin.com/LNxG2JF6

HareshKarnan · 2021-02-08T17:49:00Z

@ndrplz i dont know if it is related, but the problem here is that training does happen for first few epochs - in my case, it ran for 13 epochs and then gets stuck at 0% by epoch 14.

HareshKarnan · 2021-02-08T18:02:22Z

https://pastebin.com/QhanSNrK

Here is my output. It gets stuck at 0% while training in epoch 16

matyushinleonid · 2021-02-08T19:12:13Z

https://pastebin.com/QhanSNrK

Here is my output. It gets stuck at 0% while training in epoch 16

run your script with NCCL_ASYNC_ERROR_HANDLING=1 flag. Like NCCL_ASYNC_ERROR_HANDLING=1 python train.py. Without this flag, I do not get any error like you since, by default, this async error does not raise an Exception (at least, for me). If we have the same error traces, it will help community to fix the bug.

HareshKarnan · 2021-02-09T00:08:05Z

I ran the script again as NCCL_ASYNC_ERROR_HANDLING=1 python train.py and got no difference in the output :

https://pastebin.com/jtLyuXTN

This time it got stuck at epoch 11

genghisun · 2021-02-09T02:15:50Z

same problem here, stuck at 0% at epoch 18

Borda · 2021-02-09T08:15:36Z

Mind share code ideally in Colab to reproduce?

genghisun · 2021-02-09T15:05:37Z

In my case, it stuck at 0% at epoch 18 with 2 gpus ddp before.
Then I try to use only 1 gpu, currently trained for 100+ epochs without any problem.

stillwalker1234 · 2021-02-15T16:50:13Z

have the same issue after updating to 1.1.8, will try with 1.2.0.dev0 to see if it has the same error

pytorch 1.7
2x 3090
cuda 11.2

stillwalker1234 · 2021-02-15T22:46:00Z

1.2.0rc1 also has the issue, 1.1.6 does not

SeanNaren · 2021-02-15T23:03:05Z

Would you be able to try master? We've recently consolidated the branches back to master!

tchaton · 2021-02-16T10:10:56Z

Hey everyone,

Could it be related to this issue and solved by this PR: #6004

@HareshKarnan,

I have seen val_loss in your logs. I think it might be related. Mind trying the fix ?

Best,
T.C

stillwalker1234 · 2021-02-16T10:49:03Z

Started a run with master + @tchaton 's patch, will see how it goes.

UPDATE:

run stalled at epoch 6 :(

Borda · 2021-02-18T14:19:32Z

@tchaton any update here?

talolard · 2021-02-21T14:11:26Z

Also having this problem.
Always getting stuck on epoch 2, if I have checkpoints enabled when training with ddp on single machine

talolard · 2021-02-23T12:53:39Z

Wanted to add some details.
We ran some different code with copy paste and the problem reoccured. We had the checkpoint set up to monitor train_loss and once we took that out everything worked fine.
In other words, I suspect this has something to do with monitor on ddp .

edenlightning · 2021-02-26T22:01:17Z

Thanks! Will take a look and try to resolve it soon.

edenlightning · 2021-03-01T15:32:07Z

@HareshKarnan, @talolard, @stillwalker1234 or @genghisun can any of you please provide a reproducible script/colab?

Do you checkpoint based on a value that is not coming from lightning metrics and can be different on different processes? Probably related to #5604 (comment).

JonasFrey96 · 2021-03-03T12:48:24Z

I face the same issue and the problem was resolved by changing the number of workers to 0. This is not an acceptable fix workaround. I assume we have some deadlock when something goes wrong with spaning dataloaders.

tchaton · 2021-03-03T14:27:33Z

Dear @JonasFrey96, @taltalim,@HareshKarnan

Would it be possible for you to work on a reproducible script using the BoringModel.

Best,
T.C

jgbos · 2021-03-04T19:25:48Z

I'm having the same issue that goes away when using the default ModelCheckpoint settings. Seems to be a problem with setting monitor and save_top_k. I have not been able to achieve the same issue with BoringModel.

edit: just wanted to add that I am logging the loss as self.log("Val/Loss", loss), monitor="Val/Loss", and save_top_k=1

Anjum48 · 2021-03-05T23:14:12Z

I've been running into this issue with 1.18 & 1.2.1 the last few days with ddp and 2 GPUs

Things tried:

Upgrade from Pytorch 1.7, to 1.8
Upgrade drivers from 450 to 460
Upgrade from CUDA 10.2 to 11.2
Upgrade NCCL to latest
Rebuild entire conda env
Remove all loggers (in case of wandb multithreading)
os.environ["WANDB_START_METHOD"] = "fork"
cv2.setNumThreads(0)
DDPPlugin(find_unused_parameters=True)
sync batch norm True/False
Reduced num_workers (zero is too slow, but went down to 2)
Roll back Linux kernel in Ubuntu (I think a saw a meme about this the other day so I thought I'd give it a try haha)
Nothing unusual in dmesg

The epoch seems to be random (might stop after first, or after several), but always just after validation and on the start of the next epoch (0%). No error messages. Both GPUs locked at "100%" but the data being sent to GPU 0 (RX in nvtop) is 0 MB/s and the temperatures show that the GPUs are not working hard. 2 CPU cores locked at 100%.

Anjum48 · 2021-03-06T08:48:30Z

I think I've isolated the issue from the discussion in #5604 (comment).

This issue started when I switched the ModelCheckpoint monitor from AUC (using PL metrics and dist_sync_on_step=True) to val_loss. val_loss was not being synced between the 2 GPUs, so adding this line:

val_loss = torch.mean(self.all_gather(val_loss))

In validation_epoch_end fixed the issue for me. Model has been successfully training overnight and still running :)

ifsheldon · 2021-03-06T11:17:27Z

I can see the same issue when running my training script. My pl is 1.2.1. Still trying out which approaches mentioned above is useful in my case.

Anjum48 · 2021-03-06T11:53:49Z

Edit: response to a deleted question ¯_(ツ)_/¯

self.all_gather will return a tensor of shape [num_gpus, x] where x is the result from a single GPU. If you're only syncing a scalar, then you can just take a mean (or min/max/std) like in my example.

If for example, x is not a scalar (e.g. you want to calculate IoU or something), you can combine them using something like:

def gather_and_squash(self, x):
    # The reshape goes from [n_gpus, bs, N, N] to [n_gpus * bs, N, N]
    return torch.reshape(self.all_gather(x), [-1] + list(x.shape)[1:])

At the moment I'm working on a joint segmentation & classification problem and using valid_clf as my monitor. Here's what my validation_epoch_end looks like:

def validation_epoch_end(self, outputs):
    loss_val = torch.stack([x["val_loss"] for x in outputs]).mean()
    loss_seg = torch.stack([x["loss_seg"] for x in outputs]).mean()
    loss_clf = torch.stack([x["loss_clf"] for x in outputs]).mean()
    
    log = {
        "loss/valid_seg": torch.mean(self.all_gather(loss_seg)),
        "loss/valid_clf": torch.mean(self.all_gather(loss_clf)),
    }
           
    # AUC calculation etc...
    y_true = torch.cat([x["y_true"] for x in outputs])
    y_pred = torch.cat([x["y_pred"] for x in outputs])
    # etc ...

    self.log_dict(log)
    self.log_dict(
        {"loss/valid": loss_val, "auc/overall": log["metric"]}, prog_bar=True,
    )

jgbos · 2021-03-08T14:32:37Z

I'm also noticing that there are logger files (tfeevents) for each process now. I wonder if #6364 and this issue are related.

taltalim · 2021-03-08T21:40:02Z

At the moment I can't seem to reproduce using BoringModel. I will look into that in the next days.

senarvi · 2021-03-11T18:20:59Z

I encountered a similar issue, training hanging at the end of a validation epoch, when a custom metric is being synced between processes. I used cat as dist_reduce_fx. One would think that in the tensors produced by different processes, the first dimension - along which the tensors are going to be concatenated - doesn't have to match. However, gather_all_tensors uses torch.distributed.all_gather to gather the tensors from different processes, and all_gather requires the tensors to be correctly sized. gather_all_tensors assumes that all processes create identically-sized tensors. If the sizes don't match, syncing will hang indefinitely.

This is probably not the issue for most people, but I just wanted to point out that this can cause some confusion too. I guess there's no way to check that all processes produce identically-sized tensors. Maybe the Metric API documentation could be updated to note that all state variables must have identical shape across processes.

jgbos · 2021-03-12T14:48:35Z

@Anjum48 did implementing those reduction in val_epoch_end fix your issue with training hanging? My feeling is there's an uncaught error in the syncing of tensors like @senarvi found. I can't seem to find the error though.

jgbos · 2021-03-12T18:51:04Z

I hope this helps: I implemented all the reduction and logging in validation_epoch_end.

Looks like PL attempts to do a model checkpoint before running validation_epoch_end, the trainer crashes because it was unable to find the monitor metric. I can verify that validation_step_end is called before the crash but not validation_epoch_end. Unfortunately having troubles find the code to check this behavior.

TysonYu · 2021-03-16T04:23:48Z

I have the same problem when using version==1.2.3

TysonYu · 2021-03-16T13:06:13Z

Hi, @ifsheldon I am facing the same problem, may I ask how do you solve this problem?

jgbos · 2021-03-16T17:17:26Z

An initial test with today's master seems to show this issue is fixed for me

thiyagu145 · 2021-03-19T17:54:32Z

looks like this issue is fixed in version 1.2.4

jgbos · 2021-03-19T18:01:03Z

@thiyagu145 oh really? I'll double check, I thought it required some changes not in 1.2.4. But that would be great if 1.2.4 fixes it.

thiyagu145 · 2021-03-19T18:03:02Z

yea, training completed without any issues.

TysonYu · 2021-03-20T08:31:13Z

Yeah, now I try 1.2.4 and there is no issue anymore.

taltalim · 2021-03-21T19:24:20Z

I can confirm 1.2.4 fixes the issue. I'm asking myself which PR fixed this - possibly #6410 ?

luozhouyang · 2021-07-06T09:07:24Z

Same issue.
pytorch-lightning version 1.3.7post0

JusperLee · 2021-08-25T12:47:06Z

同样的问题。
pytorch-闪电版本1.3.7post0

You can downgrade to 1.2.4

yinrong · 2021-10-07T07:23:56Z

I'm stuck too using lightning=1.4.9
the numbers "96% 4260/4435" keeps the same for forever.

HareshKarnan · 2022-12-08T00:17:14Z

same issue with 1.8.3.post1

HareshKarnan added bug Something isn't working help wanted Open to be worked on labels Feb 7, 2021

Borda added the priority: 1 Medium priority task label Feb 9, 2021

edenlightning added distributed Generic distributed-related topic priority: 0 High priority task and removed priority: 1 Medium priority task labels Feb 9, 2021

edenlightning assigned SeanNaren Feb 10, 2021

tchaton mentioned this issue Feb 16, 2021

[BugFix] Resolve DDP hanging bug within ModelCheckpoint and val_loss #6004

Closed

12 tasks

edenlightning assigned tchaton and unassigned SeanNaren Feb 16, 2021

edenlightning unassigned tchaton Feb 26, 2021

edenlightning removed the waiting on author Waiting on user action, correction, or update label Mar 8, 2021

edenlightning assigned tchaton Mar 8, 2021

edenlightning closed this as completed Mar 22, 2021

Oktai15 mentioned this issue Mar 27, 2021

multi core, gpu training fails NVIDIA/NeMo#1974

Closed

yinrong mentioned this issue Oct 7, 2021

dpp training stuck ( lightning=1.4.9 ) #9851

Closed

yoonseok312 mentioned this issue Dec 23, 2021

DDP training randomly stopping #11242

Closed

Training stuck at 0% after few epochs while training with DDP #5865

Training stuck at 0% after few epochs while training with DDP #5865

Comments

HareshKarnan commented Feb 7, 2021

🐛 Bug

ndrplz commented Feb 8, 2021

matyushinleonid commented Feb 8, 2021

HareshKarnan commented Feb 8, 2021

HareshKarnan commented Feb 8, 2021

matyushinleonid commented Feb 8, 2021

HareshKarnan commented Feb 9, 2021 • edited Loading

genghisun commented Feb 9, 2021

Borda commented Feb 9, 2021

genghisun commented Feb 9, 2021

stillwalker1234 commented Feb 15, 2021 • edited Loading

stillwalker1234 commented Feb 15, 2021

SeanNaren commented Feb 15, 2021

tchaton commented Feb 16, 2021 • edited Loading

stillwalker1234 commented Feb 16, 2021 • edited Loading

Borda commented Feb 18, 2021

talolard commented Feb 21, 2021

talolard commented Feb 23, 2021

edenlightning commented Feb 26, 2021

edenlightning commented Mar 1, 2021 • edited Loading

JonasFrey96 commented Mar 3, 2021

tchaton commented Mar 3, 2021

jgbos commented Mar 4, 2021 • edited Loading

Anjum48 commented Mar 5, 2021 • edited Loading

Anjum48 commented Mar 6, 2021 • edited Loading

ifsheldon commented Mar 6, 2021

Anjum48 commented Mar 6, 2021 • edited Loading

jgbos commented Mar 8, 2021

taltalim commented Mar 8, 2021

senarvi commented Mar 11, 2021

jgbos commented Mar 12, 2021

jgbos commented Mar 12, 2021

TysonYu commented Mar 16, 2021

TysonYu commented Mar 16, 2021

jgbos commented Mar 16, 2021

thiyagu145 commented Mar 19, 2021

jgbos commented Mar 19, 2021

thiyagu145 commented Mar 19, 2021

TysonYu commented Mar 20, 2021

taltalim commented Mar 21, 2021

luozhouyang commented Jul 6, 2021

JusperLee commented Aug 25, 2021

yinrong commented Oct 7, 2021 • edited Loading

HareshKarnan commented Dec 8, 2022

HareshKarnan commented Feb 9, 2021 •

edited

Loading

stillwalker1234 commented Feb 15, 2021 •

edited

Loading

tchaton commented Feb 16, 2021 •

edited

Loading

stillwalker1234 commented Feb 16, 2021 •

edited

Loading

edenlightning commented Mar 1, 2021 •

edited

Loading

jgbos commented Mar 4, 2021 •

edited

Loading

Anjum48 commented Mar 5, 2021 •

edited

Loading

Anjum48 commented Mar 6, 2021 •

edited

Loading

Anjum48 commented Mar 6, 2021 •

edited

Loading

yinrong commented Oct 7, 2021 •

edited

Loading