Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training will continue... but it does not #5717

Closed
riklopfer opened this issue Jan 30, 2021 · 6 comments
Closed

Training will continue... but it does not #5717

riklopfer opened this issue Jan 30, 2021 · 6 comments
Labels
bug Something isn't working priority: 0 High priority task
Milestone

Comments

@riklopfer
Copy link

❓ Questions and Help

Related to #2644

I tried to set min_steps so that the model will continue training after the warmup + patience. Unfortunately, it does not appear to do that.

I see a bunch of log messages like this,

Epoch 17:  11%|█▏        | 4/35 [00:02<00:16,  1.86it/s, Trainer was signaled to stop but required minimum epochs (1) or minimum steps (288) has not been met. Training will continue...sion=0.542, train_recall=0.0718, train_f1=0.127]
INFO:lightning:Trainer was signaled to stop but required minimum epochs (1) or minimum steps (288) has not been met. Training will continue...

I can see this behavior in the CSV logs as well. Warm up happens for the 5 epochs. After that point it runs one step per epoch.

val_loss,val_accuracy,val_precision,val_recall,val_f1,epoch,step
0.5087231397628784,0.0,0.0,0.0,0.0,0,32
0.36191996932029724,0.0,0.0,0.0,0.0,1,65
0.29924529790878296,0.0,0.0,0.0,0.0,2,98
0.2752218246459961,0.0,0.0,0.0,0.0,3,131
0.26732462644577026,0.0,0.0,0.0,0.0,4,164
0.2639540731906891,0.0,0.0,0.0,0.0,5,197
0.263753205537796,0.0,0.0,0.0,0.0,6,198
0.26352185010910034,0.0,0.0,0.0,0.0,7,199
0.2633569538593292,0.0,0.0,0.0,0.0,8,200
0.26324737071990967,0.0,0.0,0.0,0.0,9,201

Code

    args.min_steps = wu_steps + steps_per_epoch * patience
    early_stop_callback = EarlyStopping(
        monitor='val_f1',
        min_delta=0.00,
        patience=5,
        verbose=True,
        mode='max'
    )

What's your environment?

  • OS: Linux
  • Packaging conda + pip
  • Version pytorch-lightning==1.1.4
@riklopfer riklopfer added the question Further information is requested label Jan 30, 2021
@Borda Borda added the bug Something isn't working label Feb 4, 2021
@carmocca
Copy link
Contributor

carmocca commented Feb 8, 2021

Hi! It would be great if you could provide a reproduction script.

You can use the following collab link with the BoringModel and post it here

@carmocca carmocca added waiting on author Waiting on user action, correction, or update and removed question Further information is requested labels Feb 8, 2021
@edenlightning
Copy link
Contributor

Please feel free to reopen with a reproducible example!

@plarrenie
Copy link

Hi ! (It's my very first time that I contribute to an issue on Github, I hope I have followed rules correctly)

I run into the same issue. I tried to illustrate what append with the BoringModel mentionned above.

Using a random dataset :

import os

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split, Dataset
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
from pytorch_lightning.metrics.functional import accuracy
tmpdir = os.getcwd()


# some other options for random data
from pl_bolts.datasets import RandomDataset

class RandomDataset(Dataset):
    def __init__(self, size, num_samples):
        self.len = num_samples
        self.data = torch.randn(num_samples, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

num_samples = 10000

train = RandomDataset(32, num_samples)
train = DataLoader(train, batch_size=32)

val = RandomDataset(32, num_samples)
val = DataLoader(val, batch_size=32)

test = RandomDataset(32, num_samples)
test = DataLoader(test, batch_size=32)

And adding a dummy (constant) metric on_validation_step to "early stop on"

# Model

import torch
from pytorch_lightning import LightningModule
from torch.utils.data import Dataset

class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('early_stop_on', 1) # To force early stopping

        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x['x'] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('fake_test_acc', loss)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]

"""
## Define the test
"""

def test_x(tmpdir):
    # init model
    model = BoringModel()
    early_stop = pl.callbacks.EarlyStopping()
    # Initialize a trainer
    trainer = pl.Trainer(
        min_epochs=9,
        max_epochs=10,
        progress_bar_refresh_rate=20,
        callbacks=[early_stop]
    )

    trainer.fit(model, train, val)

    # trainer.test(test_dataloaders=test)

"""
Run Test
"""

test_x(tmpdir)

We can observe the fact that model stop training by seeing that epoch speed growing insanely fast (as shown is the picture). Moreover, for some reasons, tqdm shows that model stop in middle of last epoch (step 340/626).

epoch_time

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.8.1+cu101
    • pytorch-lightning: 1.2.7
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.10
    • version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

I hope that could help to solve this issue.
Kind regards,
PL

@carmocca carmocca reopened this Apr 14, 2021
@carmocca carmocca added with code and removed waiting on author Waiting on user action, correction, or update labels Apr 14, 2021
@carmocca carmocca added this to the 1.3 milestone Apr 14, 2021
@carmocca carmocca added the priority: 0 High priority task label Apr 14, 2021
@awaelchli
Copy link
Contributor

Did this by any chance fix it for you? #6705
pip install -U pytorch-lightning

@plarrenie
Copy link

@awaelchli I tested on the boring model and on my personnal application and it seems that this issue is fixed perfectly. (both computation time and logs confirm that model is training after the "stop" signal was raised).

Thanks !

@awaelchli
Copy link
Contributor

Happy to hear that. Thanks for confirming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 0 High priority task
Projects
None yet
Development

No branches or pull requests

6 participants