DDP training randomly stopping #11242

yoonseok312 · 2021-12-23T12:54:31Z

🐛 Bug

Edit: it randomly stops in the middle of training epoch as well.

After validation ends (100%), the training process randomly stops without any error log. The stopping point changes randomly (sometimes after epoch 4 validation, sometimes after epoch 1 validation) and every time this happens, one of the machine shows 0% utilization while the others are consumed 100%. Memory is consumed as well from all gpus.

I have tried adding sync_dist=True in self.log and removed saving model checkpoint by top_k, referencing #5865. Following #9851, I already added seed_everything() as well. I checked that for training and validation, each gpu has same number of batches. However, the issue persists.

Any solution to this problem?

To Reproduce

I was unable to reproduce using the BoringModel, but as the stopping point is irregular even with same seed for pl.seed_everything, I believe it is a bug from ddp process itself.

Expected behavior

The training process should continue after validation.

Environment

PyTorch Lightning Version (e.g., 1.5.0): 1.4.9
PyTorch Version (e.g., 1.10): 1.9
Python version (e.g., 3.9): 3.8
OS (e.g., Linux): Linux
CUDA/cuDNN version: 11.2
GPU models and configuration: Google Cloud Platform A100 x8
How you installed PyTorch (conda, pip, source): pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
If compiling from source, the output of torch.__config__.show():
Any other relevant information:

Additional context

Here is my code for the trainer:

    checkpoint_callback = ModelCheckpoint(
        dirpath=log_dir,
        filename=cfg.exp_name + "-{epoch}-{val_auc:.3f}",
        every_n_epochs=1,
        save_top_k=-1,
    )

    trainer = pl.Trainer(
        callbacks=[
            checkpoint_callback,
            LearningRateMonitor(logging_interval="step"),
        ],
        max_epochs=100,
        accelerator="ddp",
        gpus=str(
            cfg.gpus
        ),  
        logger=pl.loggers.WandbLogger(project="news_recommendation", name=cfg.exp_name),
        val_check_interval= cfg[cfg.experiment_type[cfg.current_stage]].val_check_interval, # cfg[cfg.experiment_type[cfg.current_stage]].val_check_interval,
        limit_train_batches=1.0,
        deterministic=True,
        num_sanity_val_steps=0,
        resume_from_checkpoint=cfg[cfg.experiment_type[cfg.current_stage]].load_ckpt
    )

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

The text was updated successfully, but these errors were encountered:

tchaton · 2022-01-04T12:48:29Z

Hey @yoonseok312 ,

Would it be possible for you to reproduce this behavior with the BoringModel?

Best,
T.C

stale · 2022-02-06T17:10:52Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

siddharthverma314 · 2022-02-13T19:44:06Z

Hey, I'm having the same problem. Were you able to solve it?

leobxpan · 2022-02-14T18:48:13Z

Same problem with pl version 1.5.8 and pl.seed_eveything() set.

Eralien · 2022-03-03T13:26:07Z

Same here. Non DP/DDP training has no problem whatsoever.

YuFan-Microsoft · 2022-04-04T08:08:23Z

Same problem, any solution?

dselivanov · 2022-05-16T06:38:22Z

I've been stuck with this last several days. The problem seems related to NCCL communication and surprisingly correlates with logging. The clue came from setting TORCH_DISTRIBUTED_DEBUG=DETAIL which made training fail with meaningful error. It was related to NCCL being not in sync and tracebacks pointed that some ranks are still writing logs and some ranks are doing forward pass.
Then I've profiled line by line and it turned out that if I remove all self.log() entries from validation_epoch_end then training works fine!

awaelchli · 2022-05-22T19:42:22Z

@dselivanov It could be that we are missing a barrier at the end of the validation, I'm not sure. I'm a bit clueless though how that would relate to logging as by default Lightning does't do any syncing for self.log.

GGGGGGXY · 2022-05-25T09:32:10Z

same problem here.
this is my stack

gustavhartz · 2022-05-25T19:45:23Z

Same issue here. Also, have non-DDP training without any problems. Tried to remove all logging in valid_epoch_end which resolved the issue as for @dselivanov. Using pytorch-lightning==1.6.0 and WANBD logging. Conda env can be found here

pseudocode for

    def validation_epoch_end(self, outputs):
        collected = self.all_gather(outputs)
        
        # Calculate something on main process taking approx 600 sec included some logging with rank_zero_only=True
        if self.trainer.is_global_zero and not self.trainer.sanity_checking:
            calc_something(collected)
            self.log("some stuff",some_value, rank_zero_only=True)
        # With or without barrier still has same issue
        dist.barrier()

Using the wandb lib directly for logging also works fine, so replacing all calls like

self.log("some stuff",some_value, rank_zero_only=True)
# with
wandb.log({"some stuff": some_value})

resolves the issue

stale · 2022-06-28T01:11:00Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

aleSuglia · 2022-06-29T19:25:57Z

We're still experiencing this issue. Why isn't possible to log in the validation_epoch_end?

awaelchli · 2022-06-29T19:31:39Z

Could you try to set num_sanity_val_steps=False and see if it resolves the issue?
Also remove the if self.trainer.is_global_zero and rank_zero_only=True call. Only guard your calc_something(collected) with that and avoid any distributed calls in calc_something.

angadkalra · 2022-07-06T19:20:49Z

This is happening to me as well. Training hangs randomly during an epoch, sometimes restarts after an hour or I have to exit and continue training from last ckpt. I'm running on 4 V100 GPUs using DDP.

HareshKarnan · 2022-07-07T20:32:22Z

Same issue here with version 1.6.4

laurentd-lunit · 2022-07-19T00:26:33Z

Same issue here too with 1.6.5. Training hangs at the beginning of a new epoch, stuck at 0%, all GPUs but one show usage at 100% and remaining one is at 0%.
The weirdest thing is it happens only when check_val_every_n_epoch is set to 5, when is set it to 1 it works fine...

Any chance this will be fixed in later releases?

thiyagu-lily · 2022-08-04T11:20:45Z

the issue still persists in 1.7.0

awaelchli · 2022-08-04T15:47:09Z

@thiyagu-lily That's unfortunate. Without any further details we can only guess what might be the problem (see the comments above). Unfortunately, so far nobody was able to provide a reproducible case that we can work with.
This is essential for us to help, especially in these cases of distributed training, and we would be very thankful if anybody could provide us with this information.

thiyagu-lily · 2022-08-08T03:05:14Z

hi @awaelchli
I have been able to narrow down the issue to torch metrics and val check interval.
I call metrtic.update() in val_step, and then metric.compute() and metric.reset() in val_epoch_end. When i reduce the val_check_interval, to something small I dont get any deadlocks. Im using Wandblogger with torch metrics.

justusschock · 2022-08-08T07:35:25Z

Hi @thiyagu-lily , could you try whether this also happens if you use another logger or no logger at all?

dselivanov · 2022-08-08T07:46:41Z

It works without problem if I use tensor board logger directly

…

On Mon, 8 Aug 2022, 15:35 Justus Schock, ***@***.***> wrote: Hi @thiyagu-lily <https://github.com/thiyagu-lily> , could you try whether this also happens if you use another logger or no logger at all? — Reply to this email directly, view it on GitHub <#11242 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHC5XM5DDTELHV7P6C4GADVYC2EPANCNFSM5KUXGI2Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

thiyagu-lily · 2022-08-08T07:59:07Z

hi @justusschock
It happens if i use any logger! I dont think the issue is because the logger!
Can it be an issue with the metrics not being reset?

justusschock · 2022-08-08T08:02:12Z

I think it's more that there is something asynchronous with the metrics somehow that does result in the processes running out of sync.

@thiyagu-lily are you able to produce a minimal example we could debug? Preferably also with random data?

wqdfdj · 2022-12-07T13:14:36Z

I miss this problem too. I think it relate to validate. I can train correctly when I remove all validate code.

kayvane1 · 2022-12-07T13:59:00Z

I faced a similar issue and it was not related to PyTorch Lightning, it was in my case a deadlock issue as explained here: https://pytorch.org/docs/stable/notes/multiprocessing.html#avoiding-and-fighting-deadlocks

You could try amending your dataloader with pin_memory = False and reducing the allocated number of workers.
https://stackoverflow.com/questions/72183733/databricks-notebook-hanging-with-pytorch/72473053#72473053

stale · 2023-01-08T00:20:41Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

arlofaria · 2023-01-20T08:41:01Z

Having recently encountered and resolved a very similar bug to this, I'd suggest looking at #10947 , for which the root cause was found to be the use of a batch sampler that was incorrectly seeded -- resulting in different replicas' dataloaders calculating a different number of distributed batches.

HarmanDotpy · 2023-01-30T03:19:01Z

@arlofaria
Hi, thanks for your comments.
I was wondering where does the seed need to be added (or what has to be exactly done to solve the issue).
I was also thinking if you have an idea about my case. In particular, I observe that reducing the num_workers reduces the number of jobs get stuck, or delays the jobs that get stuck.
in particular, the chances of my job getting stuck is very low when num_workers=2, progressively increase as I set num_workers=4 or 8, i.e with num_workers=4 the jobs get stuck in a couple of hours to 5-6 hours, while, num_workers=8 leads to the jobs getting stuck in less than 1-2 hours.

Thanks for your help

DaBihy · 2023-01-30T10:26:24Z

I encountered the same issue and tried the solutions mentioned previously, but they didn't resolve it. Then, I found a solution while reading the documentation (link) and successfully implemented it without any issues.

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = F.cross_entropy(y_hat, y)
    pred = ...
    return {"loss": loss, "pred": pred}

def validation_step_end(self, batch_parts):
    # predictions from each GPU
    predictions = batch_parts["pred"]
    # losses from each GPU
    losses = batch_parts["loss"]

    gpu_0_prediction = predictions[0]
    gpu_1_prediction = predictions[1]

    # do something with both outputs
    return (losses[0] + losses[1]) / 2


def validation_epoch_end(self, validation_step_outputs):
    for out in validation_step_outputs:
        Something(out)
    self.log("some stuff",some_value)

arlofaria · 2023-01-31T23:29:19Z

I was wondering where does the seed need to be added (or what has to be exactly done to solve the issue).

In my situation, I was passing the optional batch_sampler argument to DataLoader, and that custom sampler was using an RNG that should have been seeded deterministically. The details are implementation-specific, but for example you might seed it in the sampler's __init__ method, or perhaps in the __iter__ method if you derived from DistributedBatchSampler and were calling its .set_epoch().

In essence, you want to guarantee that each replica is getting the same number of batches at every epoch. One way to debug this would be print LightningModule.trainer.num_training_batches and check if it ever differs between any two replicas. If you do find that there's a difference, then you should figure out why -- or to quickly workaround such a bug you could set Trainer(limit_training_batches=N) where N is less than or equal to the smallest number of batches that any replica might produce.

I was also thinking if you have an idea about my case. In particular, I observe that reducing the num_workers reduces the number of jobs get stuck, or delays the jobs that get stuck.
in particular, the chances of my job getting stuck is very low when num_workers=2, progressively increase as I set num_workers=4 or 8, i.e with num_workers=4 the jobs get stuck in a couple of hours to 5-6 hours, while, num_workers=8 leads to the jobs getting stuck in less than 1-2 hours.

Hmm, this seems like you may have a different problem than I had. My situation wasn't affected by the num_workers used by the DataLoader, but rather by the number of replicas -- i.e. the DDP world size. I always set num_workers=10 and found that I was most likely to get stuck with 2 replicas (and large datasets). In your case, I think I read somewhere that setting DataLoader(pin_memory=True) might have helped some people.

Hope this helps!

HarmanDotpy · 2023-02-02T10:30:07Z

@arlofaria
if I understand correctly, your issue was that each "rank" is having different number of batches to be processed and that was causing an issue in your case. I am assuming by replicas you mean "rank", is that right?

I used the Join() context manager (link) to make sure I was having the same number of batches in each epoch, and indeed all my ranks were having the same number of batches.

I think you are right, my issue might be different.
Actually after a LOT of effort, I have failed to solve the issue in my case. The only observation that can lead me anywhere is the fact that less number of num_workers results in training failiing at a later time, or not failing at all (ie overall failure rate in a time span is greatly reduced). I have just one more hypothesis to test out.

@kayvane1 suggested that there can be deadlock issues. since each worker loads data, each of the worker is accessing my data files. And for debugging this problem, I was running 10-15 jobs at once (all jobs being the same). each job had 4 workers. total 60 workers. I am now starting to think if these are interfering with each other to cause a deadlock while accessing data.
I will now try to make a replica of the large dataset I am using, and will train just 1 model, run for many hours and see if it fails.

@kayvane1 if possible, could you let me know how did you find that it was a deadlock issue in your case, and do you know why was the deadlock happening. and does my case sound familiar?

Thanks!

arlofaria · 2023-02-03T00:29:23Z

Yes, I meant “replica” as in “rank”. Rather than Join I had a call to self.all_gather which I think is what was ultimately blocking at the end of the epoch when some ranks had fewer batches and would never have reached that synchronization barrier.

Hope you can figure this one out — these kinds of bugs are incredibly painful to debug, but you’ll feel so relieved later. The solution is always the last thing you think to try! 😉

ytgui · 2023-03-07T01:35:23Z

This issue, unfortunately, still exists in 1.9.4, with CUDA 11.6, PyTorch 1.13.1

No way to replica, happened randomly, seems bugs related to NCCL.

A possible walkaround is to add self.trainer.strategy.barrier() if the training and validation steps contain complex logic.

stale · 2023-04-13T21:24:00Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

mk-runner · 2023-07-16T07:55:45Z

Pytorch_Lightning 2.0.4 also has this issue. After training the first epoch, the program stopped working, but the GPU usage rate reached 100% and there were no errors or warnings reported.

stevehuang52 · 2023-07-28T20:14:05Z

Pytorch_Lightning 2.0.4 also has this issue. After training the first epoch, the program stopped working, but the GPU usage rate reached 100% and there were no errors or warnings reported.

Having the same issue here....for the same dataset, only single GPU works, ddp hangs and the end of the first training epoch, while GPU usage is 100% and GPU power usage becomes lower.

francotheengineer · 2023-08-25T14:18:59Z

Pytorch_Lightning 2.0.4 also has this issue. After training the first epoch, the program stopped working, but the GPU usage rate reached 100% and there were no errors or warnings reported.

Having the same issue here....for the same dataset, only single GPU works, ddp hangs and the end of the first training epoch, while GPU usage is 100% and GPU power usage becomes lower.

I was having the same issue with Fabric DDP. 100% GPU usage but not training progress. Interestingly, the 100% usage only occurs on global rank 0 GPU.

I figured out the issue was:

def my_reduce_func(x):
    y = fabric.all_reduce(x)
    return y
x = torch.Tensor(fabric.global_rank)
if fabric.global_rank == 0:
    y = my_reduce_func(x)
    print(y)

The self.fabric.all_reduce(x) cannot be run only on the global_rank=0. It has to be run on all ranks, and the results then printed.

def my_reduce_func(x):
    y = fabric.all_reduce(x)
    return y
x = torch.Tensor(fabric.global_rank)
y = my_reduce_func(x)
if fabric.global_rank == 0:
    print(y)

For me, the my_reduce_func(x) was involved with getting accuracy metrics from validation data across all ranks. I don't think it's documented in Lightning Fabric that we cannot use all_reduce/all_gather on only global_rank 0. Unless I'm mistaken.

awaelchli · 2023-08-26T11:02:35Z

@francotheengineer It was recently documented, in the method overview, and in the API docs. Please note that while we do our best to make the things less error prone, the choice of using Fabric's flexibility naturally also brings more responsibilities to the user handling certain things correctly that would otherwise be automated by the Lightning Trainer.

francotheengineer · 2023-08-26T12:07:47Z

@awaelchli My mistake thanks. Great work on Fabric, loving using it!

awaelchli · 2023-12-31T00:43:39Z

Closing the issue due to its age. If you are experiencing issues similar to this one, please open a new ticket with the necessary details.

The most common reasons for "ddp randomly stopping" in my experience are incorrect implementation of custom samplers / batch samplers, incorrectly implemented iterable datasets, incorrect rank-zero-only guards that lead to racing conditions.

yyou1996 · 2024-09-02T00:40:28Z

In my case, the issue results from the logging synchronization.

In my code, I log sth with condition during ddp training, like:
if condition: self.log(some_values)

Since the condition is not always true, when there is the scenario some thread logs but some doesn't, the training hang.

So I modify it to log whenever condition is true to resolve my issue, like:

if condition: self.log(meaningful_values)
else: self.log(meaningless_values)

yoonseok312 added the bug Something isn't working label Dec 23, 2021

yoonseok312 changed the title ~~DDP training randomly stopping after validation~~ DDP training randomly stopping Dec 23, 2021

akihironitta added the strategy: ddp DistributedDataParallel label Dec 26, 2021

stale bot added the won't fix This will not be worked on label Feb 6, 2022

stale bot removed the won't fix This will not be worked on label Feb 13, 2022

raman-r-4978 mentioned this issue Feb 23, 2022

Trainig stuck before first epoch with ddp and multi-gpu #11910

Closed

Eralien mentioned this issue Mar 3, 2022

Distributed training hangs at model checkpoint #10947

Closed

stale bot added the won't fix This will not be worked on label Jun 28, 2022

stale bot removed the won't fix This will not be worked on label Jun 29, 2022

stale bot added the won't fix This will not be worked on label Jan 8, 2023

stale bot removed the won't fix This will not be worked on label Jan 30, 2023

HarmanDotpy mentioned this issue Apr 10, 2023

Bottleneck while training with multi gpus mlfoundations/open_clip#481

Closed

stale bot added the won't fix This will not be worked on label Apr 13, 2023

stale bot removed the won't fix This will not be worked on label Jul 16, 2023

anonymoussky mentioned this issue Oct 27, 2023

pytorch_lightning, DDP, GPU stucked at 100%, training stopped Audio-AGI/AudioSep#29

Open

drizzle0171 mentioned this issue Dec 12, 2023

[BUG] Multi-GPU Training frozen after finishing first epoch microsoft/DeepSpeed#4730

Open

awaelchli closed this as not planned Won't fix, can't repro, duplicate, stale Dec 31, 2023

kai-0430 mentioned this issue Apr 23, 2024

[BUG] DeepSpeed hangs during evaluation under multi-GPU microsoft/DeepSpeed#5394

Closed

DDP training randomly stopping #11242

DDP training randomly stopping #11242

Comments

yoonseok312 commented Dec 23, 2021 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

tchaton commented Jan 4, 2022

stale bot commented Feb 6, 2022

siddharthverma314 commented Feb 13, 2022

leobxpan commented Feb 14, 2022

Eralien commented Mar 3, 2022

YuFan-Microsoft commented Apr 4, 2022

dselivanov commented May 16, 2022 • edited Loading

awaelchli commented May 22, 2022

GGGGGGXY commented May 25, 2022

gustavhartz commented May 25, 2022 • edited Loading

stale bot commented Jun 28, 2022

aleSuglia commented Jun 29, 2022

awaelchli commented Jun 29, 2022

angadkalra commented Jul 6, 2022

HareshKarnan commented Jul 7, 2022

laurentd-lunit commented Jul 19, 2022

thiyagu-lily commented Aug 4, 2022

awaelchli commented Aug 4, 2022

thiyagu-lily commented Aug 8, 2022

justusschock commented Aug 8, 2022

dselivanov commented Aug 8, 2022 via email

thiyagu-lily commented Aug 8, 2022

justusschock commented Aug 8, 2022

wqdfdj commented Dec 7, 2022

kayvane1 commented Dec 7, 2022

stale bot commented Jan 8, 2023

arlofaria commented Jan 20, 2023

HarmanDotpy commented Jan 30, 2023

DaBihy commented Jan 30, 2023

arlofaria commented Jan 31, 2023

HarmanDotpy commented Feb 2, 2023 • edited Loading

arlofaria commented Feb 3, 2023

ytgui commented Mar 7, 2023 • edited Loading

stale bot commented Apr 13, 2023

mk-runner commented Jul 16, 2023

stevehuang52 commented Jul 28, 2023 • edited Loading

francotheengineer commented Aug 25, 2023 • edited Loading

awaelchli commented Aug 26, 2023 • edited Loading

francotheengineer commented Aug 26, 2023

awaelchli commented Dec 31, 2023

yyou1996 commented Sep 2, 2024 • edited Loading

yoonseok312 commented Dec 23, 2021 •

edited by github-actions bot

Loading

dselivanov commented May 16, 2022 •

edited

Loading

gustavhartz commented May 25, 2022 •

edited

Loading

HarmanDotpy commented Feb 2, 2023 •

edited

Loading

ytgui commented Mar 7, 2023 •

edited

Loading

stevehuang52 commented Jul 28, 2023 •

edited

Loading

francotheengineer commented Aug 25, 2023 •

edited

Loading

awaelchli commented Aug 26, 2023 •

edited

Loading

yyou1996 commented Sep 2, 2024 •

edited

Loading