-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP training randomly stopping #11242
Comments
Hey @yoonseok312 , Would it be possible for you to reproduce this behavior with the BoringModel? Best, |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Hey, I'm having the same problem. Were you able to solve it? |
Same problem with pl version 1.5.8 and |
Same here. Non DP/DDP training has no problem whatsoever. |
Same problem, any solution? |
I've been stuck with this last several days. The problem seems related to NCCL communication and surprisingly correlates with logging. The clue came from setting |
@dselivanov It could be that we are missing a barrier at the end of the validation, I'm not sure. I'm a bit clueless though how that would relate to logging as by default Lightning does't do any syncing for self.log. |
Same issue here. Also, have non-DDP training without any problems. Tried to remove all logging in valid_epoch_end which resolved the issue as for @dselivanov. Using pytorch-lightning==1.6.0 and WANBD logging. Conda env can be found here pseudocode for def validation_epoch_end(self, outputs):
collected = self.all_gather(outputs)
# Calculate something on main process taking approx 600 sec included some logging with rank_zero_only=True
if self.trainer.is_global_zero and not self.trainer.sanity_checking:
calc_something(collected)
self.log("some stuff",some_value, rank_zero_only=True)
# With or without barrier still has same issue
dist.barrier() Using the wandb lib directly for logging also works fine, so replacing all calls like self.log("some stuff",some_value, rank_zero_only=True)
# with
wandb.log({"some stuff": some_value}) resolves the issue |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
We're still experiencing this issue. Why isn't possible to log in the |
Could you try to set |
This is happening to me as well. Training hangs randomly during an epoch, sometimes restarts after an hour or I have to exit and continue training from last ckpt. I'm running on 4 V100 GPUs using DDP. |
Same issue here with version 1.6.4 |
Same issue here too with 1.6.5. Training hangs at the beginning of a new epoch, stuck at 0%, all GPUs but one show usage at 100% and remaining one is at 0%. Any chance this will be fixed in later releases? |
the issue still persists in 1.7.0 |
@thiyagu-lily That's unfortunate. Without any further details we can only guess what might be the problem (see the comments above). Unfortunately, so far nobody was able to provide a reproducible case that we can work with. |
hi @awaelchli |
Hi @thiyagu-lily , could you try whether this also happens if you use another logger or no logger at all? |
It works without problem if I use tensor board logger directly
…On Mon, 8 Aug 2022, 15:35 Justus Schock, ***@***.***> wrote:
Hi @thiyagu-lily <https://github.com/thiyagu-lily> , could you try
whether this also happens if you use another logger or no logger at all?
—
Reply to this email directly, view it on GitHub
<#11242 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHC5XM5DDTELHV7P6C4GADVYC2EPANCNFSM5KUXGI2Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
hi @justusschock |
I think it's more that there is something asynchronous with the metrics somehow that does result in the processes running out of sync. @thiyagu-lily are you able to produce a minimal example we could debug? Preferably also with random data? |
I miss this problem too. I think it relate to validate. I can train correctly when I remove all validate code. |
I faced a similar issue and it was not related to PyTorch Lightning, it was in my case a deadlock issue as explained here: https://pytorch.org/docs/stable/notes/multiprocessing.html#avoiding-and-fighting-deadlocks You could try amending your dataloader with |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
Having recently encountered and resolved a very similar bug to this, I'd suggest looking at #10947 , for which the root cause was found to be the use of a batch sampler that was incorrectly seeded -- resulting in different replicas' dataloaders calculating a different number of distributed batches. |
@arlofaria Thanks for your help |
I encountered the same issue and tried the solutions mentioned previously, but they didn't resolve it. Then, I found a solution while reading the documentation (link) and successfully implemented it without any issues. def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.cross_entropy(y_hat, y)
pred = ...
return {"loss": loss, "pred": pred}
def validation_step_end(self, batch_parts):
# predictions from each GPU
predictions = batch_parts["pred"]
# losses from each GPU
losses = batch_parts["loss"]
gpu_0_prediction = predictions[0]
gpu_1_prediction = predictions[1]
# do something with both outputs
return (losses[0] + losses[1]) / 2
def validation_epoch_end(self, validation_step_outputs):
for out in validation_step_outputs:
Something(out)
self.log("some stuff",some_value) |
In my situation, I was passing the optional In essence, you want to guarantee that each replica is getting the same number of batches at every epoch. One way to debug this would be print
Hmm, this seems like you may have a different problem than I had. My situation wasn't affected by the Hope this helps! |
@arlofaria I used the Join() context manager (link) to make sure I was having the same number of batches in each epoch, and indeed all my ranks were having the same number of batches. I think you are right, my issue might be different. @kayvane1 suggested that there can be deadlock issues. since each worker loads data, each of the worker is accessing my data files. And for debugging this problem, I was running 10-15 jobs at once (all jobs being the same). each job had 4 workers. total 60 workers. I am now starting to think if these are interfering with each other to cause a deadlock while accessing data. @kayvane1 if possible, could you let me know how did you find that it was a deadlock issue in your case, and do you know why was the deadlock happening. and does my case sound familiar? Thanks! |
Yes, I meant “replica” as in “rank”. Rather than Hope you can figure this one out — these kinds of bugs are incredibly painful to debug, but you’ll feel so relieved later. The solution is always the last thing you think to try! 😉 |
This issue, unfortunately, still exists in 1.9.4, with CUDA 11.6, PyTorch 1.13.1 No way to replica, happened randomly, seems bugs related to NCCL. A possible walkaround is to add |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
Pytorch_Lightning 2.0.4 also has this issue. After training the first epoch, the program stopped working, but the GPU usage rate reached 100% and there were no errors or warnings reported. |
Having the same issue here....for the same dataset, only single GPU works, ddp hangs and the end of the first training epoch, while GPU usage is 100% and GPU power usage becomes lower. |
I was having the same issue with Fabric DDP. 100% GPU usage but not training progress. Interestingly, the 100% usage only occurs on global rank 0 GPU. I figured out the issue was: def my_reduce_func(x):
y = fabric.all_reduce(x)
return y
x = torch.Tensor(fabric.global_rank)
if fabric.global_rank == 0:
y = my_reduce_func(x)
print(y) The def my_reduce_func(x):
y = fabric.all_reduce(x)
return y
x = torch.Tensor(fabric.global_rank)
y = my_reduce_func(x)
if fabric.global_rank == 0:
print(y) For me, the |
@francotheengineer It was recently documented, in the method overview, and in the API docs. Please note that while we do our best to make the things less error prone, the choice of using Fabric's flexibility naturally also brings more responsibilities to the user handling certain things correctly that would otherwise be automated by the Lightning Trainer. |
@awaelchli My mistake thanks. Great work on Fabric, loving using it! |
Closing the issue due to its age. If you are experiencing issues similar to this one, please open a new ticket with the necessary details. The most common reasons for "ddp randomly stopping" in my experience are incorrect implementation of custom samplers / batch samplers, incorrectly implemented iterable datasets, incorrect rank-zero-only guards that lead to racing conditions. |
In my case, the issue results from the logging synchronization. In my code, I log sth with condition during ddp training, like: Since the condition is not always true, when there is the scenario some thread logs but some doesn't, the training hang. So I modify it to log whenever condition is true to resolve my issue, like:
|
🐛 Bug
Edit: it randomly stops in the middle of training epoch as well.
After validation ends (100%), the training process randomly stops without any error log. The stopping point changes randomly (sometimes after epoch 4 validation, sometimes after epoch 1 validation) and every time this happens, one of the machine shows 0% utilization while the others are consumed 100%. Memory is consumed as well from all gpus.
I have tried adding
sync_dist=True
in self.log and removed saving model checkpoint bytop_k
, referencing #5865. Following #9851, I already addedseed_everything()
as well. I checked that for training and validation, each gpu has same number of batches. However, the issue persists.Any solution to this problem?
To Reproduce
I was unable to reproduce using the BoringModel, but as the stopping point is irregular even with same seed for
pl.seed_everything
, I believe it is a bug from ddp process itself.Expected behavior
The training process should continue after validation.
Environment
conda
,pip
, source):pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
torch.__config__.show()
:Additional context
Here is my code for the trainer:
cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7
The text was updated successfully, but these errors were encountered: