Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stuck at 0% after few epochs while training with DDP #5865

Closed
HareshKarnan opened this issue Feb 7, 2021 · 45 comments
Closed

Training stuck at 0% after few epochs while training with DDP #5865

HareshKarnan opened this issue Feb 7, 2021 · 45 comments
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 0 High priority task

Comments

@HareshKarnan
Copy link

🐛 Bug

I recently updated to pytorch_lightning 1.1.7 and noticed that after a few epochs of training, the training % is stuck at 0% and never progresses. When I switched back to 1.1.4, this strange behavior does not occur. I do not know the root cause of this issue.

  • PyTorch Version (e.g., 1.0): 1.1.7
  • OS (e.g., Linux): Linux, U18
  • How you installed PyTorch (conda, pip, source): pip install
  • Build command you used (if compiling from source):
  • Python version: 3.6
  • CUDA/cuDNN version: 10.0
  • GPU models and configuration: RTX 2080 x3
  • Any other relevant information:
@HareshKarnan HareshKarnan added bug Something isn't working help wanted Open to be worked on labels Feb 7, 2021
@matyushinleonid
Copy link

Hi, @HareshKarnan @ndrplz. Folks, may I ask you to run your pipelines with NCCL_ASYNC_ERROR_HANDLING=1?
Seems like we have the same problem. I use 1.2.0.dev0 lightning version, so this issue may migrate to the new release... My error trace:

https://pastebin.com/LNxG2JF6

@HareshKarnan
Copy link
Author

@ndrplz i dont know if it is related, but the problem here is that training does happen for first few epochs - in my case, it ran for 13 epochs and then gets stuck at 0% by epoch 14.

@HareshKarnan
Copy link
Author

https://pastebin.com/QhanSNrK

Here is my output. It gets stuck at 0% while training in epoch 16

@matyushinleonid
Copy link

https://pastebin.com/QhanSNrK

Here is my output. It gets stuck at 0% while training in epoch 16

run your script with NCCL_ASYNC_ERROR_HANDLING=1 flag. Like NCCL_ASYNC_ERROR_HANDLING=1 python train.py. Without this flag, I do not get any error like you since, by default, this async error does not raise an Exception (at least, for me). If we have the same error traces, it will help community to fix the bug.

@HareshKarnan
Copy link
Author

HareshKarnan commented Feb 9, 2021

I ran the script again as NCCL_ASYNC_ERROR_HANDLING=1 python train.py and got no difference in the output :

https://pastebin.com/jtLyuXTN

This time it got stuck at epoch 11

@genghisun
Copy link

same problem here, stuck at 0% at epoch 18

@Borda
Copy link
Member

Borda commented Feb 9, 2021

Mind share code ideally in Colab to reproduce?

@Borda Borda added the priority: 1 Medium priority task label Feb 9, 2021
@genghisun
Copy link

In my case, it stuck at 0% at epoch 18 with 2 gpus ddp before.
Then I try to use only 1 gpu, currently trained for 100+ epochs without any problem.

@edenlightning edenlightning added distributed Generic distributed-related topic priority: 0 High priority task and removed priority: 1 Medium priority task labels Feb 9, 2021
@stillwalker1234
Copy link

stillwalker1234 commented Feb 15, 2021

have the same issue after updating to 1.1.8, will try with 1.2.0.dev0 to see if it has the same error

pytorch 1.7
2x 3090
cuda 11.2

@stillwalker1234
Copy link

1.2.0rc1 also has the issue, 1.1.6 does not

@SeanNaren
Copy link
Contributor

Would you be able to try master? We've recently consolidated the branches back to master!

@tchaton
Copy link
Contributor

tchaton commented Feb 16, 2021

Hey everyone,

Could it be related to this issue and solved by this PR: #6004

@HareshKarnan,

I have seen val_loss in your logs. I think it might be related. Mind trying the fix ?

Best,
T.C

@stillwalker1234
Copy link

stillwalker1234 commented Feb 16, 2021

Started a run with master + @tchaton 's patch, will see how it goes.

UPDATE:

run stalled at epoch 6 :(

@edenlightning edenlightning assigned tchaton and unassigned SeanNaren Feb 16, 2021
@Borda
Copy link
Member

Borda commented Feb 18, 2021

@tchaton any update here?

@talolard
Copy link

Also having this problem.
Always getting stuck on epoch 2, if I have checkpoints enabled when training with ddp on single machine

@talolard
Copy link

Wanted to add some details.
We ran some different code with copy paste and the problem reoccured. We had the checkpoint set up to monitor train_loss and once we took that out everything worked fine.
In other words, I suspect this has something to do with monitor on ddp .

@edenlightning
Copy link
Contributor

Thanks! Will take a look and try to resolve it soon.

@edenlightning
Copy link
Contributor

edenlightning commented Mar 1, 2021

@HareshKarnan, @talolard, @stillwalker1234 or @genghisun can any of you please provide a reproducible script/colab?

Do you checkpoint based on a value that is not coming from lightning metrics and can be different on different processes? Probably related to #5604 (comment).

@JonasFrey96
Copy link

I face the same issue and the problem was resolved by changing the number of workers to 0. This is not an acceptable fix workaround. I assume we have some deadlock when something goes wrong with spaning dataloaders.

@tchaton
Copy link
Contributor

tchaton commented Mar 3, 2021

Dear @JonasFrey96, @taltalim,@HareshKarnan

Would it be possible for you to work on a reproducible script using the BoringModel.

Best,
T.C

@jgbos
Copy link
Contributor

jgbos commented Mar 4, 2021

I'm having the same issue that goes away when using the default ModelCheckpoint settings. Seems to be a problem with setting monitor and save_top_k. I have not been able to achieve the same issue with BoringModel.

edit: just wanted to add that I am logging the loss as self.log("Val/Loss", loss), monitor="Val/Loss", and save_top_k=1

@Anjum48
Copy link

Anjum48 commented Mar 5, 2021

I've been running into this issue with 1.18 & 1.2.1 the last few days with ddp and 2 GPUs

Things tried:

  • Upgrade from Pytorch 1.7, to 1.8
  • Upgrade drivers from 450 to 460
  • Upgrade from CUDA 10.2 to 11.2
  • Upgrade NCCL to latest
  • Rebuild entire conda env
  • Remove all loggers (in case of wandb multithreading)
  • os.environ["WANDB_START_METHOD"] = "fork"
  • cv2.setNumThreads(0)
  • DDPPlugin(find_unused_parameters=True)
  • sync batch norm True/False
  • Reduced num_workers (zero is too slow, but went down to 2)
  • Roll back Linux kernel in Ubuntu (I think a saw a meme about this the other day so I thought I'd give it a try haha)
  • Nothing unusual in dmesg

The epoch seems to be random (might stop after first, or after several), but always just after validation and on the start of the next epoch (0%). No error messages. Both GPUs locked at "100%" but the data being sent to GPU 0 (RX in nvtop) is 0 MB/s and the temperatures show that the GPUs are not working hard. 2 CPU cores locked at 100%.

@Anjum48
Copy link

Anjum48 commented Mar 6, 2021

I think I've isolated the issue from the discussion in #5604 (comment).

This issue started when I switched the ModelCheckpoint monitor from AUC (using PL metrics and dist_sync_on_step=True) to val_loss. val_loss was not being synced between the 2 GPUs, so adding this line:

val_loss = torch.mean(self.all_gather(val_loss))

In validation_epoch_end fixed the issue for me. Model has been successfully training overnight and still running :)

@ifsheldon
Copy link
Contributor

I can see the same issue when running my training script. My pl is 1.2.1. Still trying out which approaches mentioned above is useful in my case.

@Anjum48
Copy link

Anjum48 commented Mar 6, 2021

Edit: response to a deleted question ¯_(ツ)_/¯

self.all_gather will return a tensor of shape [num_gpus, x] where x is the result from a single GPU. If you're only syncing a scalar, then you can just take a mean (or min/max/std) like in my example.

If for example, x is not a scalar (e.g. you want to calculate IoU or something), you can combine them using something like:

def gather_and_squash(self, x):
    # The reshape goes from [n_gpus, bs, N, N] to [n_gpus * bs, N, N]
    return torch.reshape(self.all_gather(x), [-1] + list(x.shape)[1:])

At the moment I'm working on a joint segmentation & classification problem and using valid_clf as my monitor. Here's what my validation_epoch_end looks like:

def validation_epoch_end(self, outputs):
    loss_val = torch.stack([x["val_loss"] for x in outputs]).mean()
    loss_seg = torch.stack([x["loss_seg"] for x in outputs]).mean()
    loss_clf = torch.stack([x["loss_clf"] for x in outputs]).mean()
    
    log = {
        "loss/valid_seg": torch.mean(self.all_gather(loss_seg)),
        "loss/valid_clf": torch.mean(self.all_gather(loss_clf)),
    }
           
    # AUC calculation etc...
    y_true = torch.cat([x["y_true"] for x in outputs])
    y_pred = torch.cat([x["y_pred"] for x in outputs])
    # etc ...

    self.log_dict(log)
    self.log_dict(
        {"loss/valid": loss_val, "auc/overall": log["metric"]}, prog_bar=True,
    )

@jgbos
Copy link
Contributor

jgbos commented Mar 8, 2021

I'm also noticing that there are logger files (tfeevents) for each process now. I wonder if #6364 and this issue are related.

@edenlightning edenlightning removed the waiting on author Waiting on user action, correction, or update label Mar 8, 2021
@taltalim
Copy link

taltalim commented Mar 8, 2021

At the moment I can't seem to reproduce using BoringModel. I will look into that in the next days.

@senarvi
Copy link
Contributor

senarvi commented Mar 11, 2021

I encountered a similar issue, training hanging at the end of a validation epoch, when a custom metric is being synced between processes. I used cat as dist_reduce_fx. One would think that in the tensors produced by different processes, the first dimension - along which the tensors are going to be concatenated - doesn't have to match. However, gather_all_tensors uses torch.distributed.all_gather to gather the tensors from different processes, and all_gather requires the tensors to be correctly sized. gather_all_tensors assumes that all processes create identically-sized tensors. If the sizes don't match, syncing will hang indefinitely.

This is probably not the issue for most people, but I just wanted to point out that this can cause some confusion too. I guess there's no way to check that all processes produce identically-sized tensors. Maybe the Metric API documentation could be updated to note that all state variables must have identical shape across processes.

@jgbos
Copy link
Contributor

jgbos commented Mar 12, 2021

@Anjum48 did implementing those reduction in val_epoch_end fix your issue with training hanging? My feeling is there's an uncaught error in the syncing of tensors like @senarvi found. I can't seem to find the error though.

@jgbos
Copy link
Contributor

jgbos commented Mar 12, 2021

I hope this helps: I implemented all the reduction and logging in validation_epoch_end.

Looks like PL attempts to do a model checkpoint before running validation_epoch_end, the trainer crashes because it was unable to find the monitor metric. I can verify that validation_step_end is called before the crash but not validation_epoch_end. Unfortunately having troubles find the code to check this behavior.

@TysonYu
Copy link

TysonYu commented Mar 16, 2021

I have the same problem when using version==1.2.3

@TysonYu
Copy link

TysonYu commented Mar 16, 2021

Hi, @ifsheldon I am facing the same problem, may I ask how do you solve this problem?

@jgbos
Copy link
Contributor

jgbos commented Mar 16, 2021

An initial test with today's master seems to show this issue is fixed for me

@thiyagu145
Copy link

looks like this issue is fixed in version 1.2.4

@jgbos
Copy link
Contributor

jgbos commented Mar 19, 2021

@thiyagu145 oh really? I'll double check, I thought it required some changes not in 1.2.4. But that would be great if 1.2.4 fixes it.

@thiyagu145
Copy link

yea, training completed without any issues.

@TysonYu
Copy link

TysonYu commented Mar 20, 2021

Yeah, now I try 1.2.4 and there is no issue anymore.

@taltalim
Copy link

I can confirm 1.2.4 fixes the issue. I'm asking myself which PR fixed this - possibly #6410 ?

@luozhouyang
Copy link

Same issue.
pytorch-lightning version 1.3.7post0

@JusperLee
Copy link

同样的问题。
pytorch-闪电版本1.3.7post0

You can downgrade to 1.2.4

@yinrong
Copy link

yinrong commented Oct 7, 2021

I'm stuck too using lightning=1.4.9
the numbers "96% 4260/4435" keeps the same for forever.
image

@HareshKarnan
Copy link
Author

same issue with 1.8.3.post1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.