Training will continue... but it does not #5717

riklopfer · 2021-01-30T16:34:24Z

❓ Questions and Help

Related to #2644

I tried to set min_steps so that the model will continue training after the warmup + patience. Unfortunately, it does not appear to do that.

I see a bunch of log messages like this,

Epoch 17:  11%|█▏        | 4/35 [00:02<00:16,  1.86it/s, Trainer was signaled to stop but required minimum epochs (1) or minimum steps (288) has not been met. Training will continue...sion=0.542, train_recall=0.0718, train_f1=0.127]
INFO:lightning:Trainer was signaled to stop but required minimum epochs (1) or minimum steps (288) has not been met. Training will continue...

I can see this behavior in the CSV logs as well. Warm up happens for the 5 epochs. After that point it runs one step per epoch.

val_loss,val_accuracy,val_precision,val_recall,val_f1,epoch,step
0.5087231397628784,0.0,0.0,0.0,0.0,0,32
0.36191996932029724,0.0,0.0,0.0,0.0,1,65
0.29924529790878296,0.0,0.0,0.0,0.0,2,98
0.2752218246459961,0.0,0.0,0.0,0.0,3,131
0.26732462644577026,0.0,0.0,0.0,0.0,4,164
0.2639540731906891,0.0,0.0,0.0,0.0,5,197
0.263753205537796,0.0,0.0,0.0,0.0,6,198
0.26352185010910034,0.0,0.0,0.0,0.0,7,199
0.2633569538593292,0.0,0.0,0.0,0.0,8,200
0.26324737071990967,0.0,0.0,0.0,0.0,9,201

Code

    args.min_steps = wu_steps + steps_per_epoch * patience
    early_stop_callback = EarlyStopping(
        monitor='val_f1',
        min_delta=0.00,
        patience=5,
        verbose=True,
        mode='max'
    )

What's your environment?

OS: Linux
Packaging conda + pip
Version pytorch-lightning==1.1.4

The text was updated successfully, but these errors were encountered:

carmocca · 2021-02-08T17:33:18Z

Hi! It would be great if you could provide a reproduction script.

You can use the following collab link with the BoringModel and post it here

edenlightning · 2021-03-02T19:44:30Z

Please feel free to reopen with a reproducible example!

plarrenie · 2021-04-14T10:29:47Z

Hi ! (It's my very first time that I contribute to an issue on Github, I hope I have followed rules correctly)

I run into the same issue. I tried to illustrate what append with the BoringModel mentionned above.

Using a random dataset :

import os

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split, Dataset
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
from pytorch_lightning.metrics.functional import accuracy
tmpdir = os.getcwd()


# some other options for random data
from pl_bolts.datasets import RandomDataset

class RandomDataset(Dataset):
    def __init__(self, size, num_samples):
        self.len = num_samples
        self.data = torch.randn(num_samples, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

num_samples = 10000

train = RandomDataset(32, num_samples)
train = DataLoader(train, batch_size=32)

val = RandomDataset(32, num_samples)
val = DataLoader(val, batch_size=32)

test = RandomDataset(32, num_samples)
test = DataLoader(test, batch_size=32)

And adding a dummy (constant) metric on_validation_step to "early stop on"

# Model

import torch
from pytorch_lightning import LightningModule
from torch.utils.data import Dataset

class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('early_stop_on', 1) # To force early stopping

        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x['x'] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('fake_test_acc', loss)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]

"""
## Define the test
"""

def test_x(tmpdir):
    # init model
    model = BoringModel()
    early_stop = pl.callbacks.EarlyStopping()
    # Initialize a trainer
    trainer = pl.Trainer(
        min_epochs=9,
        max_epochs=10,
        progress_bar_refresh_rate=20,
        callbacks=[early_stop]
    )

    trainer.fit(model, train, val)

    # trainer.test(test_dataloaders=test)

"""
Run Test
"""

test_x(tmpdir)

We can observe the fact that model stop training by seeing that epoch speed growing insanely fast (as shown is the picture). Moreover, for some reasons, tqdm shows that model stop in middle of last epoch (step 340/626).

CUDA:
- GPU:
  - Tesla T4
- available: True
- version: 10.1
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.8.1+cu101
- pytorch-lightning: 1.2.7
- tqdm: 4.41.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.10
- version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

I hope that could help to solve this issue.
Kind regards,
PL

awaelchli · 2021-04-18T16:34:53Z

Did this by any chance fix it for you? #6705
pip install -U pytorch-lightning

plarrenie · 2021-04-19T08:19:04Z

@awaelchli I tested on the boring model and on my personnal application and it seems that this issue is fixed perfectly. (both computation time and logs confirm that model is training after the "stop" signal was raised).

Thanks !

awaelchli · 2021-04-19T08:29:35Z

Happy to hear that. Thanks for confirming.

riklopfer added the question Further information is requested label Jan 30, 2021

Borda added the bug Something isn't working label Feb 4, 2021

carmocca added waiting on author Waiting on user action, correction, or update and removed question Further information is requested labels Feb 8, 2021

edenlightning closed this as completed Mar 2, 2021

carmocca reopened this Apr 14, 2021

carmocca added with code and removed waiting on author Waiting on user action, correction, or update labels Apr 14, 2021

carmocca added this to the 1.3 milestone Apr 14, 2021

carmocca added the priority: 0 High priority task label Apr 14, 2021

awaelchli closed this as completed Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training will continue... but it does not #5717

Training will continue... but it does not #5717

riklopfer commented Jan 30, 2021

carmocca commented Feb 8, 2021

edenlightning commented Mar 2, 2021

plarrenie commented Apr 14, 2021

awaelchli commented Apr 18, 2021

plarrenie commented Apr 19, 2021

awaelchli commented Apr 19, 2021

Training will continue... but it does not #5717

Training will continue... but it does not #5717

Comments

riklopfer commented Jan 30, 2021

❓ Questions and Help

Code

What's your environment?

carmocca commented Feb 8, 2021

edenlightning commented Mar 2, 2021

plarrenie commented Apr 14, 2021

awaelchli commented Apr 18, 2021

plarrenie commented Apr 19, 2021

awaelchli commented Apr 19, 2021