Apex with multiple optimizers error "element 0 of tensors does not require grad and does not have grad_fn" #5642

awaelchli · 2021-01-24T22:05:23Z

🐛 Bug

 File "repro apex.py", line 51, in <module>
    trainer.fit(model)
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 481, in fit
    results = self.accelerator_backend.train()
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/accelerators/gpu_accelerator.py", line 67, in train
    results = self.train_or_test()
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/accelerators/accelerator.py", line 68, in train_or_test
    results = self.trainer.train()
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 532, in train
    self.train_loop.run_training_epoch()
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/training_loop.py", line 572, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/training_loop.py", line 729, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/training_loop.py", line 505, in optimizer_step
    model_ref.optimizer_step(
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/core/lightning.py", line 1263, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/core/optimizer.py", line 278, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/core/optimizer.py", line 133, in __optimizer_step
    trainer.precision_connector.backend.optimizer_step(trainer, optimizer, closure)
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/plugins/apex.py", line 138, in optimizer_step
    closure()
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/training_loop.py", line 719, in train_step_and_backward_closure
    result = self.training_step_and_backward(
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/training_loop.py", line 827, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/training_loop.py", line 847, in backward
    result.closure_loss = self.trainer.accelerator_backend.backward(
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/accelerators/accelerator.py", line 97, in backward
    closure_loss = self.trainer.precision_connector.backend.backward(
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/plugins/apex.py", line 53, in backward
    model.backward(closure_loss, optimizer, opt_idx)
  File "/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/core/lightning.py", line 1155, in backward
    loss.backward(*args, **kwargs)
  File "/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

To Reproduce

import torch
from torch import optim
from torch.utils.data import Dataset, DataLoader

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class AMPModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx, optimizer_idx):
        output = self(batch)
        loss = output.mean()
        return {"loss": loss}

    def train_dataloader(self):
        return DataLoader(RandomDataset(32, 64))

    def configure_optimizers(self):
        optimizer1 = torch.optim.Adam(self.parameters(), lr=0.01)
        optimizer2 = optim.SGD(self.parameters(), lr=0.01)
        return [optimizer1, optimizer2]


if __name__ == "__main__":
    model = AMPModel()
    trainer = Trainer(
        max_epochs=1,
        precision=16,
        amp_backend='apex',
        gpus=1,
    )
    trainer.fit(model)

Expected behavior

No crash

Environment

CUDA:
- GPU:
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- available: True
- version: 11.0
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.2.0dev
- tqdm: 4.56.0
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.3
- version: Proposal for help #1 SMP Thu Apr 9 13:49:54 UTC 2020

Additional context

discovered in #5507, in the test tests/models/test_amp::test_amp_with_apex

The text was updated successfully, but these errors were encountered:

tchaton · 2021-01-24T22:52:09Z

Hey @awaelchli ,

A fix is already in review. Blocked by will 😁

Lightning is toggling on the current optimizer (setting requieres_grad=True) and untoggling on the second one (setting to False).

As self.parameters is used for both optimizers, your model doesn t have any trainable parameters and the loss doesn t have graph_fn.

The fix will restore the pre-toggle state.

Closing duplicated issue.

Best,
T.C

awaelchli · 2021-01-24T22:54:09Z

Thanks, I was searching but couldn't find a related issue. Would be great if you could link it :)

awaelchli · 2021-01-27T02:24:27Z

Fixed by the linked PR. Thanks @tchaton

dave-epstein · 2021-07-31T22:14:42Z

Hi, I'm still getting this issue training a model with multiple optimizers. I get the issue when using FSDP or DeepSpeed. I'm really stumped on the bug, and help would be appreciated. Thanks!

awaelchli added bug Something isn't working help wanted Open to be worked on 3rd party Related to a 3rd-party labels Jan 24, 2021

awaelchli mentioned this issue Jan 24, 2021

[tests/models] refactor with BoringModel #5507

Merged

12 tasks

tchaton closed this as completed Jan 24, 2021

tchaton reopened this Jan 25, 2021

tchaton mentioned this issue Jan 25, 2021

[bugfix] Resolve bug with multiple optimizers and toggle. #5574

Merged

12 tasks

tchaton self-assigned this Jan 25, 2021

awaelchli closed this as completed Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apex with multiple optimizers error "element 0 of tensors does not require grad and does not have grad_fn" #5642

Apex with multiple optimizers error "element 0 of tensors does not require grad and does not have grad_fn" #5642

awaelchli commented Jan 24, 2021 •

edited

Loading

tchaton commented Jan 24, 2021 •

edited

Loading

awaelchli commented Jan 24, 2021

awaelchli commented Jan 27, 2021

dave-epstein commented Jul 31, 2021

Apex with multiple optimizers error "element 0 of tensors does not require grad and does not have grad_fn" #5642

Apex with multiple optimizers error "element 0 of tensors does not require grad and does not have grad_fn" #5642

Comments

awaelchli commented Jan 24, 2021 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

tchaton commented Jan 24, 2021 • edited Loading

awaelchli commented Jan 24, 2021

awaelchli commented Jan 27, 2021

dave-epstein commented Jul 31, 2021

awaelchli commented Jan 24, 2021 •

edited

Loading

tchaton commented Jan 24, 2021 •

edited

Loading