PyTorch Lightning 1.4.1 crashes during training #8821

rishikanthc · 2021-08-09T22:59:57Z

🐛 Bug

When I start training on 2 opus using pytorch-lightning 1.4.1 the training crashes after a few epochs. Note that this happens only on 1.4.1
If I run my code using pytorch-lightning 1.4.0 everything works fine.

There are multiple versions of the same error with different versions. For brevity I'm attaching just one trace.
Here's the error trace:

Global seed set to 20
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Files already downloaded and verified
Files already downloaded and verified
Global seed set to 20
Global seed set to 20
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Using native 16bit precision.
Global seed set to 20
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name     | Type             | Params
----------------------------------------------
0 | resnet18 | ResNet           | 11.2 M
1 | loss     | CrossEntropyLoss | 0
----------------------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.881    Total estimated model params size (MB)
Global seed set to 20
Global seed set to 20
/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:322: UserWarning: The number of training samples (44) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 4:  47%|█████████████████████                        | 23/49 [00:02<00:02,  9.20it/s, loss=2.51, v_num=17, val_loss=3.260, val_acc=0.239, train_loss=2.760, train_acc=0.296]terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7ff9a6d3fa22 in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10e9e (0x7ff9a6fa0e9e in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7ff9a6fa2147 in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7ff9a6d295a4 in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa2822a (0x7ffa4bb4722a in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: python() [0x4efd28]
frame #6: python() [0x5fb977]
frame #7: python() [0x5ab432]
<omitting python frames>
frame #9: python() [0x4f34b2]
frame #10: python() [0x5a6eaa]
frame #25: python() [0x50b868]
frame #30: python() [0x59be64]
frame #31: python() [0x5a6f17]
frame #42: python() [0x59c16d]
frame #43: python() [0x5a6f17]
frame #49: python() [0x5a7031]
frame #50: python() [0x69e536]
frame #52: python() [0x5c3cb0]
frame #60: python() [0x5038a2]

Traceback (most recent call last):
  File "resnet18cifar.py", line 177, in <module>
    trainer.fit(model)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
    self._run(model)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
    self._dispatch()
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
    self.accelerator.start_training(self)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
    return self._run_train()
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run_train
    self.training_type_plugin.reconciliate_processes(traceback.format_exc())
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 453, in reconciliate_processes
    raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0
 Traceback (most recent call last):
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
    self.fit_loop.run()
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
    batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 101, in run
    super().run(batch, batch_idx, dataloader_idx)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 148, in advance
    result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 202, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 396, in _optimizer_step
    model_ref.optimizer_step(
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1593, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 292, in optimizer_step
    make_optimizer_step = self.precision_plugin.pre_optimizer_step(
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 59, in pre_optimizer_step
    result = lambda_closure()
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 236, in _training_step_and_backward_closure
    result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 547, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 588, in backward
    result.closure_loss = self.trainer.accelerator.backward(result.closure_loss, optimizer, *args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 276, in backward
    self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 78, in backward
    model.backward(closure_loss, optimizer, *args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1465, in backward
    loss.backward(*args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 11444) is killed by signal: Aborted.

To Reproduce

Here's my code.
It's a simple code which trains resnet18 on cifar using 2 gpus with DDP.

Expected behavior

It's supposed to train for 100 epochs and

Environment

* CUDA:
	- GPU:
		- RTX A5000
		- RTX A5000
	- available:         True
	- version:           11.1
* Packages:
	- numpy:             1.21.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.9.0+cu111
	- pytorch-lightning: 1.4.1
	- tqdm:              4.62.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.8.10
	- version:           #27~20.04.1-Ubuntu SMP Tue Jul 13 17:41:23 UTC 2021

Additional context

The error happens irrespective of whether I use DP or DDP

MohammedAljahdali · 2021-08-10T05:17:38Z

This issue happened to me, I thought it was related to how the dataloaders are handled, and I decided to postponed using multi-GPU on my project, so I didn't investigate it further.

dhkim0225 · 2021-08-12T04:38:25Z

I have the same issue. And succeed to reproduce it with following code.
The following code has been little changed from @rishikanthc .
Interestingly, when I use 1 epoch, the problem doesn't occur.
PL 1.4.2 also have a same issue.

import torch
import math
from torchvision.datasets import CIFAR100
from torchmetrics.functional import accuracy
import torchvision.transforms as transforms
from torch.utils.data import random_split, DataLoader
from torch import nn
from torchsummary import summary
import pytorch_lightning as pl
import wandb
from pytorch_lightning.loggers import CSVLogger, WandbLogger
from pytorch_lightning.plugins import DDPPlugin
from torchvision.models import resnet18 as res18


def weights_init(m):
    if isinstance(m, nn.Conv2d):
        if m.weight is not None: nn.init.xavier_normal_(m.weight.data)
        if m.bias is not None: nn.init.xavier_normal_(m.bias.data)

class resnet(pl.LightningModule):
    def __init__(self, nclasses=10, bs=256, lr=3e-4, epochs=100, workers=2):
        super().__init__()
        self.resnet18 = res18(pretrained=True)
        self.loss = nn.CrossEntropyLoss()
        self.lr = lr
        self.bs = bs
        self.eps = epochs
        self.workers = workers
        self.ngpus = torch.cuda.device_count()
        self.save_hyperparameters()
    
    def forward(self, x):
        out = self.resnet18(x)

        return out 
    
    def prepare_data(self):
        CIFAR100(root='/tmp', train=True, download=True)
        CIFAR100(root='/tmp', train=False, download=True)
    
    def setup(self, stage):
        cifar10_mean, cifar10_std = (0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261)
        cifar100_mean, cifar100_std = (0.5070, 0.4865, 0.4409), (0.2673, 0.2564, 0.2761)

        transforms_train = transforms.Compose([
            transforms.Pad(4),
            transforms.RandomHorizontalFlip(),
            transforms.RandomCrop(32),
            transforms.ToTensor(),
            transforms.Normalize(cifar100_mean, cifar100_std)
        ])
        transforms_test = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(cifar100_mean, cifar100_std)
        ])

        trainset = CIFAR100(root='/tmp', train=True, download=False, transform = transforms_train)
        self.trainset, self.valset = torch.utils.data.random_split(trainset, [45000, 5000])
        self.testset = CIFAR100(root='/tmp', train=False, download=False, transform = transforms_test)
        self.t_steps = math.ceil(len(self.trainset)/self.ngpus / self.bs)
    
    def train_dataloader(self):
        return DataLoader(self.trainset, batch_size=self.bs, shuffle=True, num_workers=self.workers)
    
    def val_dataloader(self):
        return DataLoader(self.valset, batch_size = self.bs, shuffle=False, num_workers=self.workers)
    
    def test_dataloader(self):
        return DataLoader(self.testset, batch_size = self.bs, shuffle=False, num_workers=self.workers)
    
    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=self.lr, weight_decay=1e-5)
        schedulers = [
            {
                'scheduler': torch.optim.lr_scheduler.OneCycleLR(optimizer, cycle_momentum=True, pct_start=0.45,
                anneal_strategy='cos', max_lr=self.lr, epochs=self.eps, steps_per_epoch=self.t_steps, three_phase=True),
                'interval': 'step',
            }
        ]

        return [optimizer], schedulers
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.loss(y_hat, y)
        _, preds = torch.max(y_hat, 1)
        acc = accuracy(y, preds)

        self.log('train_loss', loss, prog_bar=True, on_step=False, on_epoch=True, logger=True)
        self.log('train_acc', acc, prog_bar=True, on_step=False, on_epoch=True, logger=True)

        return loss
    
    def training_epoch_end(self, outputs):
        loss = torch.tensor([x['loss'] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        x, y = batch
        out = self(x)
        loss = self.loss(out, y)
        _, preds = torch.max(out, 1)
        acc = accuracy(y, preds)

        self.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True, logger=True)
        self.log('val_acc', acc, prog_bar=True, on_step=False, on_epoch=True, logger=True)

        return loss
    
    def validation_epoch_end(self, outputs):
        loss = torch.tensor(outputs).mean()
    
    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        _, preds = torch.max(y_hat, 1)
        acc = accuracy(y, preds)

        self.log('test_acc', acc, prog_bar = True, on_step=False, on_epoch = True, logger = True)

        return acc

if __name__ == '__main__':
    pl.seed_everything(20)
    num_gpus = torch.cuda.device_count()
    nepochs = 100
    model = resnet(nclasses=100, bs=512, lr=1, epochs=nepochs, workers=8)

    lr_monitor = pl.callbacks.LearningRateMonitor(logging_interval='step', log_momentum=True)
    chkpt = pl.callbacks.ModelCheckpoint(monitor='val_loss')

    es = pl.callbacks.EarlyStopping(monitor='val_loss', patience=10)
    wandb_logger = WandbLogger(project='pruning-cifar100')
    csv_logger = CSVLogger(save_dir = './csv_logs', name='resnet18-cifar100')
    trainer = pl.Trainer(
        gpus=num_gpus, accelerator='ddp', auto_select_gpus=True, precision=16, max_epochs=nepochs,
        default_root_dir='./checkpoints/resnet18',#logger=[wandb_logger, csv_logger],
        callbacks = [lr_monitor], plugins=DDPPlugin(find_unused_parameters=False))

    trainer.fit(model)
    trainer.test(model, ckpt_path='best')

reproduced environment:

(cu102) dh@dh-desktop:~/Downloads $ python collect_env_details.py
* CUDA:
	- GPU:
		- GeForce GTX 1080 Ti
	- available:         True
	- version:           10.2
* Packages:
	- numpy:             1.21.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.9.0+cu102
	- pytorch-lightning: 1.4.2
	- tqdm:              4.61.2
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.9.5
	- version:           #27~20.04.1-Ubuntu SMP Tue Jul 13 17:41:23 UTC 2021

InCogNiTo124 · 2021-08-12T07:48:11Z

Maybe it is connected with this?

pytorch/pytorch#8976

dhkim0225 · 2021-08-12T09:05:21Z

@InCogNiTo124
The code works well with PL 1.4.0. I think it's not a pytorch's bug.

stonelazy · 2021-08-12T11:54:03Z

@InCogNiTo124
The code works well with PL 1.4.0. I think it's not a pytorch's bug.

I hope you meant, it is pytorch's bug ?

rishikanthc · 2021-08-12T16:03:00Z

yeah it runs successfully for a few epochs. For me it never crossed 10 epochs. Which is what is weird.

stonelazy · 2021-08-13T02:09:30Z

For me in 1.4.1 it never crossed 5 hours of training, been trying it for 3-4 days continuously.. Then i downgraded it to 1.4.0 and it has successfully completed 15hours now. FYI.

dhkim0225 · 2021-08-13T04:19:29Z

@stonelazy I think this is the issue of PL

In my case, the Training phase has no problem and when the test phase starts, the error occurs.

HMJiangGatech · 2021-08-16T20:25:43Z

Same thing happened to me for multi-gpu ddp with multi-workers. Downgrading to 1.4.0 resolved the problem.

Tau-J · 2021-08-18T04:08:25Z

Same thing happened to me when using multi-gpu ddp with multi-workers after I upgrade my PL to v1.4.2. But sadly even though I downgrading to 1.4.0, this bug still exits. :(

yoichi-yamakawa · 2021-08-20T03:28:32Z

This problem could be caused by self.log in using DDP training.
When all the processes call this method, synchronization induces a deadlock, I think.
I faced with similar case, but I have seemed to solve it by changing the code like below.

self.log("my-log-name", value)
↓
self.log("my-log-name", value, rank_zero_only=True)

rank_zero_only feature has been added by this PR: #7966
my env is below

* CUDA:
        - GPU:
                - Tesla V100-SXM2-16GB
                - Tesla V100-SXM2-16GB
                - Tesla V100-SXM2-16GB
                - Tesla V100-SXM2-16GB
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.5
        - pyTorch_debug:     False
        - pyTorch_version:   1.7.1
        - pytorch-lightning: 1.4.2
        - tqdm:              4.56.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         x86_64
        - python:            3.8.6
        - version:           #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020

nik-sm · 2021-08-21T01:52:10Z

Is this related to #4471?

yoichi-yamakawa · 2021-08-25T11:35:38Z

I'm not sure, but considering their stack-traces, it seems to have some relationship.
In fact, rank_zero_only=True works for other people as well as me.
I don't know whether this is the essential solution or not.

Gateway2745 · 2021-08-25T17:03:04Z

Faced exact same issues. Model crashed during validation. In addition to what @yoichi-yamakawa mentioned above, setting validation num_workers=0 also solves the issue, but of course that is not ideal.

carmocca · 2021-08-31T00:44:31Z

@tchaton we might want to bump the priority on this, seems like many users are experiencing this.

whoknowsb · 2021-08-31T07:02:49Z

This problem could be caused by self.log in using DDP training.
When all the processes call this method, synchronization induces a deadlock, I think.
I faced with similar case, but I have seemed to solve it by changing the code like below.

self.log("my-log-name", value)
↓
self.log("my-log-name", value, rank_zero_only=True)

rank_zero_only feature has been added by this PR: #7966
my env is below
* CUDA:
        - GPU:
                - Tesla V100-SXM2-16GB
                - Tesla V100-SXM2-16GB
                - Tesla V100-SXM2-16GB
                - Tesla V100-SXM2-16GB
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.5
        - pyTorch_debug:     False
        - pyTorch_version:   1.7.1
        - pytorch-lightning: 1.4.2
        - tqdm:              4.56.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         x86_64
        - python:            3.8.6
        - version:           #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020  

I was trying the solution proposed here.
I noticed that the training was much longer without encountering any problem, but in the end I still got a system crash two times (system literally freezed).
This may be related to why some one is facing the issue even in TPUs ##9197 (comment)

thepurpleowl · 2021-08-31T08:10:57Z

I think rank_zero_only=True might not be the solution. It is introduced in release 1.4.0, and I tested with 1.4.4, 1.4.1, 1.4.0, 1.3.8. I am getting the same error in all four.

I am following all instructions as per documentation, so not sure where its going wrong. I have my skeleton code in this discussion #9197 if anybody finds any obvious coding mistakes, let me know.
PS: The same code gives no error with single GPU with PL 1.4.4

tchaton · 2021-08-31T11:57:19Z

Hey everyone,

I can confirm I could reproduce the error and I will start investigating. Thanks for your patience and we apologise for the inconvenience.

Best,
T.C

carmocca · 2021-08-31T14:20:09Z

Here's the minimal reproduction code:

import torch
from torch.utils.data import DataLoader, Dataset

import pytorch_lightning as pl
from pytorch_lightning import LightningModule


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log('train_loss', torch.tensor(1), on_epoch=True)
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

    def train_dataloader(self):
        return DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=1)


if __name__ == '__main__':
    model = BoringModel()
    trainer = pl.Trainer(
        gpus=1,
        accelerator='ddp',
        limit_train_batches=1,
        max_epochs=100,
        checkpoint_callback=False,
        logger=False,
    )
    trainer.fit(model)

$ CUDA_LAUNCH_BLOCKING=1 python bug.py
...
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f726617fa22 in /home/carlos/venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10e9e (0x7f72663e0e9e in /home/carlos/venv/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f72663e2147 in /home/carlos/venv/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f72661695a4 in /home/carlos/venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa2822a (0x7f730af8722a in /home/carlos/venv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: /home/carlos/venv/bin/python() [0x4ef828]
frame #6: /home/carlos/venv/bin/python() [0x5fb497]
frame #7: PyTraceBack_Here + 0x6db (0x54242b in /home/carlos/venv/bin/python)
frame #8: _PyEval_EvalFrameDefault + 0x3aec (0x56d32c in /home/carlos/venv/bin/python)
frame #9: /home/carlos/venv/bin/python() [0x50a23e]
frame #10: _PyEval_EvalFrameDefault + 0x5757 (0x56ef97 in /home/carlos/venv/bin/python)
frame #11: _PyFunction_Vectorcall + 0x1b6 (0x5f5e56 in /home/carlos/venv/bin/python)
frame #12: _PyEval_EvalFrameDefault + 0x5757 (0x56ef97 in /home/carlos/venv/bin/python)
frame #13: _PyEval_EvalCodeWithName + 0x26a (0x56822a in /home/carlos/venv/bin/python)
frame #14: _PyFunction_Vectorcall + 0x393 (0x5f6033 in /home/carlos/venv/bin/python)
frame #15: _PyObject_FastCallDict + 0x48 (0x5f5808 in /home/carlos/venv/bin/python)
frame #16: _PyObject_Call_Prepend + 0x61 (0x5f5a21 in /home/carlos/venv/bin/python)
frame #17: /home/carlos/venv/bin/python() [0x59b60b]
frame #18: _PyObject_MakeTpCall + 0x296 (0x5f3446 in /home/carlos/venv/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x598a (0x56f1ca in /home/carlos/venv/bin/python)
frame #20: _PyFunction_Vectorcall + 0x1b6 (0x5f5e56 in /home/carlos/venv/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x71e (0x569f5e in /home/carlos/venv/bin/python)
frame #22: _PyEval_EvalCodeWithName + 0x26a (0x56822a in /home/carlos/venv/bin/python)
frame #23: _PyFunction_Vectorcall + 0x393 (0x5f6033 in /home/carlos/venv/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x71e (0x569f5e in /home/carlos/venv/bin/python)
frame #25: _PyFunction_Vectorcall + 0x1b6 (0x5f5e56 in /home/carlos/venv/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x5757 (0x56ef97 in /home/carlos/venv/bin/python)
frame #27: _PyFunction_Vectorcall + 0x1b6 (0x5f5e56 in /home/carlos/venv/bin/python)
frame #28: /home/carlos/venv/bin/python() [0x50a33c]
frame #29: PyObject_Call + 0x1f7 (0x5f2b87 in /home/carlos/venv/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x1f70 (0x56b7b0 in /home/carlos/venv/bin/python)
frame #31: _PyFunction_Vectorcall + 0x1b6 (0x5f5e56 in /home/carlos/venv/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x8f6 (0x56a136 in /home/carlos/venv/bin/python)
frame #33: _PyFunction_Vectorcall + 0x1b6 (0x5f5e56 in /home/carlos/venv/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x8f6 (0x56a136 in /home/carlos/venv/bin/python)
frame #35: _PyFunction_Vectorcall + 0x1b6 (0x5f5e56 in /home/carlos/venv/bin/python)
frame #36: /home/carlos/venv/bin/python() [0x50a33c]
frame #37: PyObject_Call + 0x1f7 (0x5f2b87 in /home/carlos/venv/bin/python)
frame #38: /home/carlos/venv/bin/python() [0x654fbc]
frame #39: /home/carlos/venv/bin/python() [0x674aa8]
frame #40: <unknown function> + 0x9609 (0x7f730d495609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #41: clone + 0x43 (0x7f730d5d1293 in /lib/x86_64-linux-gnu/libc.so.6)

Traceback (most recent call last):
  File "/home/carlos/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/carlos/venv/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3397436) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/carlos/pytorch-lightning/kk.py", line 49, in <module>
    trainer.fit(model)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 547, in fit
    self._call_and_handle_interrupt(self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 502, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 577, in _fit_impl
    self._run(model)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1001, in _run
    self._dispatch()
  File "/home/carlos/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1072, in _dispatch
    self.accelerator.start_training(self)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/accelerators/accelerator.py", line 91, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 170, in start_training
    self._results = trainer.run_stage()
  File "/home/carlos/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1082, in run_stage
    return self._run_train()
  File "/home/carlos/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1123, in _run_train
    self.fit_loop.run()
  File "/home/carlos/pytorch-lightning/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/loops/fit_loop.py", line 206, in advance
    epoch_output = self.epoch_loop.run(data_fetcher)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/loops/base.py", line 106, in run
    self.on_run_start(*args, **kwargs)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 107, in on_run_start
    self.dataloader_iter = _prepare_dataloader_iter(dataloader_iter, self.batch_idx + 1)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/loops/utilities.py", line 169, in _prepare_dataloader_iter
    dataloader_iter = enumerate(data_fetcher, batch_idx)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/utilities/fetching.py", line 200, in __iter__
    self.prefetching(self.prefetch_batches)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/utilities/fetching.py", line 256, in prefetching
    self._fetch_next_batch()
  File "/home/carlos/pytorch-lightning/pytorch_lightning/utilities/fetching.py", line 298, in _fetch_next_batch
    batch = next(self.dataloader_iter)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/trainer/supporters.py", line 569, in __next__
    return self.request_next_batch(self.loader_iters)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/trainer/supporters.py", line 597, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next_fn)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/utilities/apply_func.py", line 93, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/carlos/pytorch-lightning/pytorch_lightning/trainer/supporters.py", line 584, in next_fn
    batch = next(iterator)
  File "/home/carlos/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/carlos/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    idx, data = self._get_data()
  File "/home/carlos/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
    success, data = self._try_get_data()
  File "/home/carlos/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3397436) exited unexpectedly

Running on master with the deadlock detection removed.

Current findings:

needs num_workers > 0
only for ddp (spawn works)
returning a loss has impact
epoch=True breaks with any reduce_fx
step=True, sync_dist=True breaks with reduce_fx != "mean"

tchaton · 2021-08-31T19:42:01Z

Hey everyone,

After a long day of debugging with @carmocca, we finally found the source of the problem. Should be fixed on master and next weekly release.

Best,
T.C

InCogNiTo124 · 2021-08-31T19:49:19Z

Good job! Can't wait to try the fix as soon as possible

whoknowsb · 2021-08-31T20:21:35Z

Can’t wait to see what was the annoying problem! 😂😭

InCogNiTo124 · 2021-09-01T10:11:54Z

Can’t wait to see what was the annoying problem!

copy vs deepcopy, somehow

tchaton · 2021-09-01T10:35:48Z

Hey @InCogNiTo124,

Mistery to me. But my guess is that PyTorch is doing quite a lot of work on deepcopy compared to copy: https://github.com/pytorch/pytorch/blob/83e28a7d281c91a6d1a12b86bd5fb212dd424a85/torch/_tensor.py#L80

And copy might not as strict as deepcopy.

Best,
T.C

dhkim0225 · 2021-09-03T02:37:48Z

Works well with PL >= 1.4.5 !! Thanks!!

mees · 2021-09-03T18:30:57Z

This fix that was introduced in PL 1.4.5 crashes my DDP trainings completely, so that it doesn't run any training step. However, I also experienced the original bug reported in the 1.4.x versions. @tchaton

  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run
    self._dispatch()
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch
    self.accelerator.start_training(self)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 995, in run_stage
    return self._run_train()
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in _run_train
    self.fit_loop.run()
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
    batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
    super().run(batch, batch_idx, dataloader_idx)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 149, in advance
    self.batch_outputs[opt_idx].append(deepcopy(result.training_step_output))
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 205, in _deepcopy_list
    append(deepcopy(a, memo))
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/home/meeso/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/_tensor.py", line 55, in __deepcopy__
    raise RuntimeError("Only Tensors created explicitly by the user "
RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

nik-sm · 2021-09-03T20:56:19Z

Related to this issue - is there already an integration test for DDP, or would it be possible to add one?
At the moment I'm using version 1.3.8 where I don't experience this issue, and it seems like it has been present for several releases (which would indicate that either there's a test missing, or the test failed to detect the issue).

carmocca · 2021-09-04T02:12:55Z

which would indicate that either there's a test missing, or the test failed to detect the issue

We have DDP tests! However, testing this is not so easy as it requires num_workers>0 to be set, which slows the test a lot and can hang with pytest.

mees · 2021-09-04T16:50:14Z

So I have found why PL 1.4.5 breaks my training with the error " Only Tensors created explicitly by the user (graph leaves) support the deepcopy" @tchaton @carmocca . I was returning a dictionary for training_step, where I had some pytorch distributions in a dictionary I was returning to do some plotting in a callback. Since these distributions are non-leaf nodes, the newly introduced deepcopy fails https://github.com/PyTorchLightning/pytorch-lightning/pull/9239/files#diff-eba04421e2e60c9d7b55f56ca57c9a319a4e3de38e3eb96c931324ecacbb73e0

Basically, I had something like this

def training_step(self, batch, batch_idx):
    x, y, z = batch
    out, mean, std = self.encoder(x)
    loss = self.loss(out, x)
    encoders_dict = {"one": Independent(Normal(mean, std)), 1)}
    return {"loss": loss, "encoders_dict": encoders_dict}

Returning only the loss tensor works as a workaround.

carmocca · 2021-09-07T00:51:24Z

@mees in that case, the encoders_dict should be getting detached automatically. Did you try adding the .detach() yourself?

encoders_dict = {"one": Independent(Normal(mean, std)), 1).detach()}
return {"loss": loss, "encoders_dict": encoders_dict}

popfido · 2021-09-24T09:12:28Z

meets the same error with the same Traceback of @mees in pytorch-lightning 1.4.8, but this time only with

return {"loss": loss}

so the problem might not because of encoders_dict?

carmocca · 2021-09-24T14:07:50Z

@popfido I can check it if you share a repro script, but this should be fixed in master.

popfido · 2021-09-29T08:24:23Z

@popfido I can check it if you share a repro script, but this should be fixed in master.

Hi @carmocca,

This scene happends when I call pl.LightningModule.log in its training step and leave a Metric to record like log.('xxx', MetricEntity). But when I call it like log.('xxx', MetricEntity.compute()), it works.

With:
pytorch 1.8.0
pytorch-lightning 1.4.8
torchmetrics 0.5.1

Metric I used is AUROC

rishikanthc added bug Something isn't working help wanted Open to be worked on labels Aug 9, 2021

nik-sm mentioned this issue Aug 21, 2021

Help with understanding unknown 'c10::Error' thrown during DDP training #4471

Closed

carmocca added distributed Generic distributed-related topic priority: 0 High priority task and removed help wanted Open to be worked on labels Aug 31, 2021

tchaton self-assigned this Aug 31, 2021

tchaton mentioned this issue Aug 31, 2021

[bugfix] Prevent a DDP failure using copy #9239

Merged

11 tasks

tchaton closed this as completed in #9239 Aug 31, 2021

tchaton reopened this Sep 3, 2021

This was referenced Sep 7, 2021

Share the training step output data via ClosureResult #9349

Merged

Weekly Patch Release v1.4.6 [full merge, no squash] #9358

Merged

carmocca assigned carmocca and unassigned tchaton Sep 8, 2021

carmocca closed this as completed in #9349 Sep 10, 2021

carmocca mentioned this issue Oct 22, 2021

Fix support for dataclasses with ClassVar/InitVar in apply_to_collection #9702

Merged

12 tasks

ctankep mentioned this issue Nov 1, 2022

" rank_zero_only " runtime error deforum-art/deforum-stable-diffusion#1

Closed

lminer mentioned this issue Mar 15, 2023

Data loaders abort when using multi-processing with a remote Aim repo aimhubio/aim#2540

Open

mjkvaak mentioned this issue Oct 11, 2023

Data loading hangs before first validation step #4450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch Lightning 1.4.1 crashes during training #8821

PyTorch Lightning 1.4.1 crashes during training #8821

rishikanthc commented Aug 9, 2021

MohammedAljahdali commented Aug 10, 2021

dhkim0225 commented Aug 12, 2021 •

edited

Loading

InCogNiTo124 commented Aug 12, 2021

dhkim0225 commented Aug 12, 2021

stonelazy commented Aug 12, 2021 •

edited

Loading

rishikanthc commented Aug 12, 2021

stonelazy commented Aug 13, 2021

dhkim0225 commented Aug 13, 2021

HMJiangGatech commented Aug 16, 2021

Tau-J commented Aug 18, 2021

yoichi-yamakawa commented Aug 20, 2021 •

edited

Loading

nik-sm commented Aug 21, 2021

yoichi-yamakawa commented Aug 25, 2021 •

edited

Loading

Gateway2745 commented Aug 25, 2021 •

edited

Loading

carmocca commented Aug 31, 2021 •

edited

Loading

whoknowsb commented Aug 31, 2021

thepurpleowl commented Aug 31, 2021 •

edited

Loading

tchaton commented Aug 31, 2021

carmocca commented Aug 31, 2021 •

edited

Loading

tchaton commented Aug 31, 2021

InCogNiTo124 commented Aug 31, 2021

whoknowsb commented Aug 31, 2021

InCogNiTo124 commented Sep 1, 2021

tchaton commented Sep 1, 2021

dhkim0225 commented Sep 3, 2021

mees commented Sep 3, 2021 •

edited by carmocca

Loading

nik-sm commented Sep 3, 2021 •

edited

Loading

carmocca commented Sep 4, 2021

mees commented Sep 4, 2021 •

edited by akihironitta

Loading

carmocca commented Sep 7, 2021

popfido commented Sep 24, 2021

carmocca commented Sep 24, 2021

popfido commented Sep 29, 2021 •

edited

Loading

PyTorch Lightning 1.4.1 crashes during training #8821

PyTorch Lightning 1.4.1 crashes during training #8821

Comments

rishikanthc commented Aug 9, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

MohammedAljahdali commented Aug 10, 2021

dhkim0225 commented Aug 12, 2021 • edited Loading

InCogNiTo124 commented Aug 12, 2021

dhkim0225 commented Aug 12, 2021

stonelazy commented Aug 12, 2021 • edited Loading

rishikanthc commented Aug 12, 2021

stonelazy commented Aug 13, 2021

dhkim0225 commented Aug 13, 2021

HMJiangGatech commented Aug 16, 2021

Tau-J commented Aug 18, 2021

yoichi-yamakawa commented Aug 20, 2021 • edited Loading

nik-sm commented Aug 21, 2021

yoichi-yamakawa commented Aug 25, 2021 • edited Loading

Gateway2745 commented Aug 25, 2021 • edited Loading

carmocca commented Aug 31, 2021 • edited Loading

whoknowsb commented Aug 31, 2021

thepurpleowl commented Aug 31, 2021 • edited Loading

tchaton commented Aug 31, 2021

carmocca commented Aug 31, 2021 • edited Loading

tchaton commented Aug 31, 2021

InCogNiTo124 commented Aug 31, 2021

whoknowsb commented Aug 31, 2021

InCogNiTo124 commented Sep 1, 2021

tchaton commented Sep 1, 2021

dhkim0225 commented Sep 3, 2021

mees commented Sep 3, 2021 • edited by carmocca Loading

nik-sm commented Sep 3, 2021 • edited Loading

carmocca commented Sep 4, 2021

mees commented Sep 4, 2021 • edited by akihironitta Loading

carmocca commented Sep 7, 2021

popfido commented Sep 24, 2021

carmocca commented Sep 24, 2021

popfido commented Sep 29, 2021 • edited Loading

dhkim0225 commented Aug 12, 2021 •

edited

Loading

stonelazy commented Aug 12, 2021 •

edited

Loading

yoichi-yamakawa commented Aug 20, 2021 •

edited

Loading

yoichi-yamakawa commented Aug 25, 2021 •

edited

Loading

Gateway2745 commented Aug 25, 2021 •

edited

Loading

carmocca commented Aug 31, 2021 •

edited

Loading

thepurpleowl commented Aug 31, 2021 •

edited

Loading

carmocca commented Aug 31, 2021 •

edited

Loading

mees commented Sep 3, 2021 •

edited by carmocca

Loading

nik-sm commented Sep 3, 2021 •

edited

Loading

mees commented Sep 4, 2021 •

edited by akihironitta

Loading

popfido commented Sep 29, 2021 •

edited

Loading