Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP training randomly stopping #11242

Closed
yoonseok312 opened this issue Dec 23, 2021 · 41 comments
Closed

DDP training randomly stopping #11242

yoonseok312 opened this issue Dec 23, 2021 · 41 comments
Labels
bug Something isn't working strategy: ddp DistributedDataParallel

Comments

@yoonseok312
Copy link

yoonseok312 commented Dec 23, 2021

🐛 Bug

Edit: it randomly stops in the middle of training epoch as well.

After validation ends (100%), the training process randomly stops without any error log. The stopping point changes randomly (sometimes after epoch 4 validation, sometimes after epoch 1 validation) and every time this happens, one of the machine shows 0% utilization while the others are consumed 100%. Memory is consumed as well from all gpus.

I have tried adding sync_dist=True in self.log and removed saving model checkpoint by top_k, referencing #5865. Following #9851, I already added seed_everything() as well. I checked that for training and validation, each gpu has same number of batches. However, the issue persists.

Any solution to this problem?

image

스크린샷 2021-12-23 오후 9 31 58

To Reproduce

I was unable to reproduce using the BoringModel, but as the stopping point is irregular even with same seed for pl.seed_everything, I believe it is a bug from ddp process itself.

Expected behavior

The training process should continue after validation.

Environment

  • PyTorch Lightning Version (e.g., 1.5.0): 1.4.9
  • PyTorch Version (e.g., 1.10): 1.9
  • Python version (e.g., 3.9): 3.8
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration: Google Cloud Platform A100 x8
  • How you installed PyTorch (conda, pip, source): pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

Here is my code for the trainer:

    checkpoint_callback = ModelCheckpoint(
        dirpath=log_dir,
        filename=cfg.exp_name + "-{epoch}-{val_auc:.3f}",
        every_n_epochs=1,
        save_top_k=-1,
    )

    trainer = pl.Trainer(
        callbacks=[
            checkpoint_callback,
            LearningRateMonitor(logging_interval="step"),
        ],
        max_epochs=100,
        accelerator="ddp",
        gpus=str(
            cfg.gpus
        ),  
        logger=pl.loggers.WandbLogger(project="news_recommendation", name=cfg.exp_name),
        val_check_interval= cfg[cfg.experiment_type[cfg.current_stage]].val_check_interval, # cfg[cfg.experiment_type[cfg.current_stage]].val_check_interval,
        limit_train_batches=1.0,
        deterministic=True,
        num_sanity_val_steps=0,
        resume_from_checkpoint=cfg[cfg.experiment_type[cfg.current_stage]].load_ckpt
    )

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

@yoonseok312 yoonseok312 added the bug Something isn't working label Dec 23, 2021
@yoonseok312 yoonseok312 changed the title DDP training randomly stopping after validation DDP training randomly stopping Dec 23, 2021
@akihironitta akihironitta added the strategy: ddp DistributedDataParallel label Dec 26, 2021
@tchaton
Copy link
Contributor

tchaton commented Jan 4, 2022

Hey @yoonseok312 ,

Would it be possible for you to reproduce this behavior with the BoringModel?

Best,
T.C

@stale
Copy link

stale bot commented Feb 6, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Feb 6, 2022
@siddharthverma314
Copy link

Hey, I'm having the same problem. Were you able to solve it?

@stale stale bot removed the won't fix This will not be worked on label Feb 13, 2022
@leobxpan
Copy link

Same problem with pl version 1.5.8 and pl.seed_eveything() set.

@Eralien
Copy link

Eralien commented Mar 3, 2022

Same here. Non DP/DDP training has no problem whatsoever.

@YuFan-Microsoft
Copy link

Same problem, any solution?

@dselivanov
Copy link

dselivanov commented May 16, 2022

I've been stuck with this last several days. The problem seems related to NCCL communication and surprisingly correlates with logging. The clue came from setting TORCH_DISTRIBUTED_DEBUG=DETAIL which made training fail with meaningful error. It was related to NCCL being not in sync and tracebacks pointed that some ranks are still writing logs and some ranks are doing forward pass.
Then I've profiled line by line and it turned out that if I remove all self.log() entries from validation_epoch_end then training works fine!

@awaelchli
Copy link
Contributor

@dselivanov It could be that we are missing a barrier at the end of the validation, I'm not sure. I'm a bit clueless though how that would relate to logging as by default Lightning does't do any syncing for self.log.

@GGGGGGXY
Copy link

same problem here.
this is my stack
image

image

image

@gustavhartz
Copy link

gustavhartz commented May 25, 2022

Same issue here. Also, have non-DDP training without any problems. Tried to remove all logging in valid_epoch_end which resolved the issue as for @dselivanov. Using pytorch-lightning==1.6.0 and WANBD logging. Conda env can be found here

pseudocode for

    def validation_epoch_end(self, outputs):
        collected = self.all_gather(outputs)
        
        # Calculate something on main process taking approx 600 sec included some logging with rank_zero_only=True
        if self.trainer.is_global_zero and not self.trainer.sanity_checking:
            calc_something(collected)
            self.log("some stuff",some_value, rank_zero_only=True)
        # With or without barrier still has same issue
        dist.barrier()

Using the wandb lib directly for logging also works fine, so replacing all calls like

self.log("some stuff",some_value, rank_zero_only=True)
# with
wandb.log({"some stuff": some_value})

resolves the issue

@stale
Copy link

stale bot commented Jun 28, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jun 28, 2022
@aleSuglia
Copy link

We're still experiencing this issue. Why isn't possible to log in the validation_epoch_end?

@stale stale bot removed the won't fix This will not be worked on label Jun 29, 2022
@awaelchli
Copy link
Contributor

Could you try to set num_sanity_val_steps=False and see if it resolves the issue?
Also remove the if self.trainer.is_global_zero and rank_zero_only=True call. Only guard your calc_something(collected) with that and avoid any distributed calls in calc_something.

@angadkalra
Copy link

This is happening to me as well. Training hangs randomly during an epoch, sometimes restarts after an hour or I have to exit and continue training from last ckpt. I'm running on 4 V100 GPUs using DDP.

@HareshKarnan
Copy link

Same issue here with version 1.6.4

@laurentd-lunit
Copy link

Same issue here too with 1.6.5. Training hangs at the beginning of a new epoch, stuck at 0%, all GPUs but one show usage at 100% and remaining one is at 0%.
The weirdest thing is it happens only when check_val_every_n_epoch is set to 5, when is set it to 1 it works fine...

Any chance this will be fixed in later releases?

@thiyagu-lily
Copy link

the issue still persists in 1.7.0

@awaelchli
Copy link
Contributor

@thiyagu-lily That's unfortunate. Without any further details we can only guess what might be the problem (see the comments above). Unfortunately, so far nobody was able to provide a reproducible case that we can work with.
This is essential for us to help, especially in these cases of distributed training, and we would be very thankful if anybody could provide us with this information.

@thiyagu-lily
Copy link

hi @awaelchli
I have been able to narrow down the issue to torch metrics and val check interval.
I call metrtic.update() in val_step, and then metric.compute() and metric.reset() in val_epoch_end. When i reduce the val_check_interval, to something small I dont get any deadlocks. Im using Wandblogger with torch metrics.

@justusschock
Copy link
Member

Hi @thiyagu-lily , could you try whether this also happens if you use another logger or no logger at all?

@dselivanov
Copy link

dselivanov commented Aug 8, 2022 via email

@thiyagu-lily
Copy link

hi @justusschock
It happens if i use any logger! I dont think the issue is because the logger!
Can it be an issue with the metrics not being reset?

@justusschock
Copy link
Member

I think it's more that there is something asynchronous with the metrics somehow that does result in the processes running out of sync.

@thiyagu-lily are you able to produce a minimal example we could debug? Preferably also with random data?

@wqdfdj
Copy link

wqdfdj commented Dec 7, 2022

I miss this problem too. I think it relate to validate. I can train correctly when I remove all validate code.

@kayvane1
Copy link

kayvane1 commented Dec 7, 2022

I faced a similar issue and it was not related to PyTorch Lightning, it was in my case a deadlock issue as explained here: https://pytorch.org/docs/stable/notes/multiprocessing.html#avoiding-and-fighting-deadlocks

You could try amending your dataloader with pin_memory = False and reducing the allocated number of workers.
https://stackoverflow.com/questions/72183733/databricks-notebook-hanging-with-pytorch/72473053#72473053

@stale
Copy link

stale bot commented Jan 8, 2023

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jan 8, 2023
@arlofaria
Copy link

Having recently encountered and resolved a very similar bug to this, I'd suggest looking at #10947 , for which the root cause was found to be the use of a batch sampler that was incorrectly seeded -- resulting in different replicas' dataloaders calculating a different number of distributed batches.

@HarmanDotpy
Copy link

@arlofaria
Hi, thanks for your comments.
I was wondering where does the seed need to be added (or what has to be exactly done to solve the issue).
I was also thinking if you have an idea about my case. In particular, I observe that reducing the num_workers reduces the number of jobs get stuck, or delays the jobs that get stuck.
in particular, the chances of my job getting stuck is very low when num_workers=2, progressively increase as I set num_workers=4 or 8, i.e with num_workers=4 the jobs get stuck in a couple of hours to 5-6 hours, while, num_workers=8 leads to the jobs getting stuck in less than 1-2 hours.

Thanks for your help

@stale stale bot removed the won't fix This will not be worked on label Jan 30, 2023
@DaBihy
Copy link

DaBihy commented Jan 30, 2023

I encountered the same issue and tried the solutions mentioned previously, but they didn't resolve it. Then, I found a solution while reading the documentation (link) and successfully implemented it without any issues.

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = F.cross_entropy(y_hat, y)
    pred = ...
    return {"loss": loss, "pred": pred}

def validation_step_end(self, batch_parts):
    # predictions from each GPU
    predictions = batch_parts["pred"]
    # losses from each GPU
    losses = batch_parts["loss"]

    gpu_0_prediction = predictions[0]
    gpu_1_prediction = predictions[1]

    # do something with both outputs
    return (losses[0] + losses[1]) / 2


def validation_epoch_end(self, validation_step_outputs):
    for out in validation_step_outputs:
        Something(out)
    self.log("some stuff",some_value)

@arlofaria
Copy link

I was wondering where does the seed need to be added (or what has to be exactly done to solve the issue).

In my situation, I was passing the optional batch_sampler argument to DataLoader, and that custom sampler was using an RNG that should have been seeded deterministically. The details are implementation-specific, but for example you might seed it in the sampler's __init__ method, or perhaps in the __iter__ method if you derived from DistributedBatchSampler and were calling its .set_epoch().

In essence, you want to guarantee that each replica is getting the same number of batches at every epoch. One way to debug this would be print LightningModule.trainer.num_training_batches and check if it ever differs between any two replicas. If you do find that there's a difference, then you should figure out why -- or to quickly workaround such a bug you could set Trainer(limit_training_batches=N) where N is less than or equal to the smallest number of batches that any replica might produce.

I was also thinking if you have an idea about my case. In particular, I observe that reducing the num_workers reduces the number of jobs get stuck, or delays the jobs that get stuck.
in particular, the chances of my job getting stuck is very low when num_workers=2, progressively increase as I set num_workers=4 or 8, i.e with num_workers=4 the jobs get stuck in a couple of hours to 5-6 hours, while, num_workers=8 leads to the jobs getting stuck in less than 1-2 hours.

Hmm, this seems like you may have a different problem than I had. My situation wasn't affected by the num_workers used by the DataLoader, but rather by the number of replicas -- i.e. the DDP world size. I always set num_workers=10 and found that I was most likely to get stuck with 2 replicas (and large datasets). In your case, I think I read somewhere that setting DataLoader(pin_memory=True) might have helped some people.

Hope this helps!

@HarmanDotpy
Copy link

HarmanDotpy commented Feb 2, 2023

@arlofaria
if I understand correctly, your issue was that each "rank" is having different number of batches to be processed and that was causing an issue in your case. I am assuming by replicas you mean "rank", is that right?

I used the Join() context manager (link) to make sure I was having the same number of batches in each epoch, and indeed all my ranks were having the same number of batches.

I think you are right, my issue might be different.
Actually after a LOT of effort, I have failed to solve the issue in my case. The only observation that can lead me anywhere is the fact that less number of num_workers results in training failiing at a later time, or not failing at all (ie overall failure rate in a time span is greatly reduced). I have just one more hypothesis to test out.

@kayvane1 suggested that there can be deadlock issues. since each worker loads data, each of the worker is accessing my data files. And for debugging this problem, I was running 10-15 jobs at once (all jobs being the same). each job had 4 workers. total 60 workers. I am now starting to think if these are interfering with each other to cause a deadlock while accessing data.
I will now try to make a replica of the large dataset I am using, and will train just 1 model, run for many hours and see if it fails.

@kayvane1 if possible, could you let me know how did you find that it was a deadlock issue in your case, and do you know why was the deadlock happening. and does my case sound familiar?

Thanks!

@arlofaria
Copy link

Yes, I meant “replica” as in “rank”. Rather than Join I had a call to self.all_gather which I think is what was ultimately blocking at the end of the epoch when some ranks had fewer batches and would never have reached that synchronization barrier.

Hope you can figure this one out — these kinds of bugs are incredibly painful to debug, but you’ll feel so relieved later. The solution is always the last thing you think to try! 😉

@ytgui
Copy link

ytgui commented Mar 7, 2023

This issue, unfortunately, still exists in 1.9.4, with CUDA 11.6, PyTorch 1.13.1

No way to replica, happened randomly, seems bugs related to NCCL.

A possible walkaround is to add self.trainer.strategy.barrier() if the training and validation steps contain complex logic.

@stale
Copy link

stale bot commented Apr 13, 2023

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 13, 2023
@mk-runner
Copy link

Pytorch_Lightning 2.0.4 also has this issue. After training the first epoch, the program stopped working, but the GPU usage rate reached 100% and there were no errors or warnings reported.

@stale stale bot removed the won't fix This will not be worked on label Jul 16, 2023
@stevehuang52
Copy link

stevehuang52 commented Jul 28, 2023

Pytorch_Lightning 2.0.4 also has this issue. After training the first epoch, the program stopped working, but the GPU usage rate reached 100% and there were no errors or warnings reported.

Having the same issue here....for the same dataset, only single GPU works, ddp hangs and the end of the first training epoch, while GPU usage is 100% and GPU power usage becomes lower.

@francotheengineer
Copy link

francotheengineer commented Aug 25, 2023

Pytorch_Lightning 2.0.4 also has this issue. After training the first epoch, the program stopped working, but the GPU usage rate reached 100% and there were no errors or warnings reported.

Having the same issue here....for the same dataset, only single GPU works, ddp hangs and the end of the first training epoch, while GPU usage is 100% and GPU power usage becomes lower.

I was having the same issue with Fabric DDP. 100% GPU usage but not training progress. Interestingly, the 100% usage only occurs on global rank 0 GPU.

I figured out the issue was:

def my_reduce_func(x):
    y = fabric.all_reduce(x)
    return y
x = torch.Tensor(fabric.global_rank)
if fabric.global_rank == 0:
    y = my_reduce_func(x)
    print(y)

The self.fabric.all_reduce(x) cannot be run only on the global_rank=0. It has to be run on all ranks, and the results then printed.

def my_reduce_func(x):
    y = fabric.all_reduce(x)
    return y
x = torch.Tensor(fabric.global_rank)
y = my_reduce_func(x)
if fabric.global_rank == 0:
    print(y)

For me, the my_reduce_func(x) was involved with getting accuracy metrics from validation data across all ranks. I don't think it's documented in Lightning Fabric that we cannot use all_reduce/all_gather on only global_rank 0. Unless I'm mistaken.

@awaelchli
Copy link
Contributor

awaelchli commented Aug 26, 2023

@francotheengineer It was recently documented, in the method overview, and in the API docs. Please note that while we do our best to make the things less error prone, the choice of using Fabric's flexibility naturally also brings more responsibilities to the user handling certain things correctly that would otherwise be automated by the Lightning Trainer.

@francotheengineer
Copy link

@awaelchli My mistake thanks. Great work on Fabric, loving using it!

@awaelchli
Copy link
Contributor

Closing the issue due to its age. If you are experiencing issues similar to this one, please open a new ticket with the necessary details.

The most common reasons for "ddp randomly stopping" in my experience are incorrect implementation of custom samplers / batch samplers, incorrectly implemented iterable datasets, incorrect rank-zero-only guards that lead to racing conditions.

@yyou1996
Copy link

yyou1996 commented Sep 2, 2024

In my case, the issue results from the logging synchronization.

In my code, I log sth with condition during ddp training, like:
if condition: self.log(some_values)

Since the condition is not always true, when there is the scenario some thread logs but some doesn't, the training hang.

So I modify it to log whenever condition is true to resolve my issue, like:

if condition: self.log(meaningful_values)
else: self.log(meaningless_values)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working strategy: ddp DistributedDataParallel
Projects
None yet
Development

No branches or pull requests