Optimizers are broken with auto_lr_find=True since 1.1.4 #6285

indigoviolet · 2021-03-02T04:23:33Z

🐛 Bug

It seems like #5244 (which went out with 1.1.4) caused some bad interaction with auto_lr_find=True.

Specifically, lightning_optimizers are now cached on the Trainer. However, if we update the lr with auto_lr_find=True, we would expect the optimizers returned from configure_optimizers to change -- so that the lightning_optimizers need to be updated -- but this is no longer handled because we no longer re-wrap the optimizers in the general case.

The outcome for me is that training just doesnt converge because we're updating the wrong optimizer.

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1PJGOBSUdl5_-U9O-fvo83V1On6_siwAC?usp=sharing

To Reproduce

See the colab^

Expected behavior

Training should work!

Environment

CUDA:
- GPU:
  - Tesla T4
- available: True
- version: 10.1
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.7.1+cu101
- pytorch-lightning: 1.2.1
- tqdm: 4.41.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.10
- version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

This was a pretty frustrating bug to track down, it broke training on my model in a super unconnected way and I had to literally git bisect both my repo and pytorch-lightning's repo to find it.
It's scary to me that the bug seems to have gone unnoticed for so many versions -- does no one use auto_lr_find=True? Are there no test cases checking this combination?

The text was updated successfully, but these errors were encountered:

indigoviolet · 2021-03-02T04:24:00Z

fyi @ananthsub, since you might remember the original code

awaelchli · 2021-03-06T00:39:51Z

Thanks for reporting this.

import os

from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
tmpdir = os.getcwd()

import torch
from pytorch_lightning import LightningModule


def check_optimizer(trainer, when):
    assert trainer.lightning_optimizers[0].optimizer is trainer.optimizers[0], when


class RandomDataset(Dataset):
    def __init__(self, size, num_samples):
        self.len = num_samples
        self.data = torch.randn(num_samples, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)
        self.lr = 0.

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log("loss", loss)
        return loss

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        return optimizer


class OptimizerSpy(pl.callbacks.Callback):
    def on_fit_start(self, trainer, *args, **kwargs):
        # UNCOMMENT TO FIX BUG
        # trainer._lightning_optimizers = None
        check_optimizer(trainer, "on_fit_start")


if __name__ == "__main__":
    model = BoringModel()
    num_samples = 10000
    train = RandomDataset(32, num_samples)
    train = DataLoader(train, batch_size=32)

    trainer = pl.Trainer(
        max_epochs=1,
        auto_lr_find=True,
        callbacks=[OptimizerSpy()]
    )
    trainer.tune(model, train)
    check_optimizer(trainer, "after tune")
    trainer.fit(model, train)
    check_optimizer(trainer, "after fit")

Minimal repro code based on code provided by @indigoviolet
The comment in on_fit_start callback shows what we need to fix.

awaelchli · 2021-03-06T02:53:34Z

@indigoviolet I propose this fix here: #6372
let me know if that solves your issues.

indigoviolet · 2021-03-08T02:16:16Z

Thanks for the fix, @awaelchli!

indigoviolet added bug Something isn't working help wanted Open to be worked on labels Mar 2, 2021

edenlightning added tuner priority: 0 High priority task with code labels Mar 2, 2021

awaelchli self-assigned this Mar 6, 2021

awaelchli mentioned this issue Mar 6, 2021

fix trainer not resetting lightning_optimizers #6372

Merged

11 tasks

carmocca closed this as completed in #6372 Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizers are broken with auto_lr_find=True since 1.1.4 #6285

Optimizers are broken with auto_lr_find=True since 1.1.4 #6285

indigoviolet commented Mar 2, 2021

indigoviolet commented Mar 2, 2021

awaelchli commented Mar 6, 2021

awaelchli commented Mar 6, 2021

indigoviolet commented Mar 8, 2021

Optimizers are broken with auto_lr_find=True since 1.1.4 #6285

Optimizers are broken with auto_lr_find=True since 1.1.4 #6285

Comments

indigoviolet commented Mar 2, 2021

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

indigoviolet commented Mar 2, 2021

awaelchli commented Mar 6, 2021

awaelchli commented Mar 6, 2021

indigoviolet commented Mar 8, 2021