WandB dropping items when logging LR or val_loss with accumulate_grad_batches > 1 #5469

tadejsv · 2021-01-11T20:36:13Z

🐛 Bug

As you can see in the BoringModel, I get the following warnings from WandB logger:

wandb: WARNING Step must only increase in log calls.  Step 49 < 98; dropping {'lr-SGD': 0.1}.
wandb: WARNING Step must only increase in log calls.  Step 99 < 198; dropping {'lr-SGD': 0.1}.
wandb: WARNING Step must only increase in log calls.  Step 149 < 199; dropping {'lr-SGD': 0.1}.
wandb: WARNING Step must only increase in log calls.  Step 149 < 298; dropping {'lr-SGD': 0.1}.
wandb: WARNING Step must only increase in log calls.  Step 156 < 299; dropping {'val_loss': 3.9880808209860966e-14, 'epoch': 0}.

This occurs when I add the following to the basic BordigModel:

log train loss with self.log()
add WandB logger
add accumulate_grad_batches > 1
either add LearningRateMonitor callback or log validation loss with self.log() (or both, as in the colab)

If any of these things is removed, the error doesn't occur.

The end result is that the LR metrics are note being logged at all. Worse than that, validation loss (and any other metrics that there would be!) do not get logged, making the logger useless.

The text was updated successfully, but these errors were encountered:

tadejsv · 2021-01-11T20:54:34Z

Also if it helps: this occurs in every version down to (and including) 1.1.0, but does not occur in 1.0.8

borisdayma · 2021-01-12T17:40:50Z

Do you have any call to wandb.log() in your validation loop. If so, you can try to add commit=False.
It seems that for some reason the wrong step is being used in the validation loop. However self.log() is supposed to handle correctly the right step (this could be an issue with gradient accumulation too).

You will be able to bypass these issues with #5194 by using sync_step=False.

siddk · 2021-01-18T22:53:44Z

Any update on when #5194 will be merged in (sync_step=False functionality).

Additionally, I'm not sure that sync_step=False will entirely solve the gradient accumulation problem. Specifically, this line just means that the Logger will try logging at trainer.global_step which is still out of sync with validation?

For example, here's an example use-case. I set max_steps=100 in my trainer, and val_check_interval=50 (validates every 50 steps). Gradient Accumulation is set with accumulate_grad_batches=2.

global_step for my Trainer will reflect a range from 0-200, while for validation, it'll show (50, 100), even though it should map directly to (100, 200).

Is there a way to fix this more cleanly? Pinging @SeanNaren because I've discussed this problem with him via the Lightning Slack.

tadej-redstone · 2021-01-29T10:26:44Z

Any update on this?

borisdayma · 2021-01-29T13:28:51Z

The wandb workaround has been merged #5194 and will be available in PL v1.2.

tadej-redstone · 2021-01-29T14:01:36Z

Thanks, can confirm this is the case in the BoringModel notebook

stale · 2021-02-28T16:38:55Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

borisdayma · 2021-03-08T02:57:05Z

This was fixed with #5931.
You can now just log at any time with self.log('my_metric', my_value) and you won't have any dropped value.
Just choose your x-axis appropriately in the UI (whether global_step or just the auto-incremented step).

Thiggel · 2024-11-05T15:20:50Z

I am still running into this problem, but weirdly enough only at accumulate_grad_batches > 8.

Here is a minimal example:

import os
import lightning.pytorch as L
from lightning.pytorch.loggers import WandbLogger
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import Adam


class SimpleDataset(Dataset):
    def __init__(self, size=640, input_dim=10, output_dim=1):
        self.X = torch.randn(size, input_dim)
        self.y = torch.randn(size, output_dim)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]


class SimpleModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = torch.nn.functional.mse_loss(y_hat, y)
        self.log(
            "train_loss",
            loss,
            on_step=True,
            on_epoch=True,
            prog_bar=True,
            logger=True,
        )
        print(f"Step: {self.global_step}")
        return loss

    def configure_optimizers(self):
        return Adam(self.parameters(), lr=1e-3)


# - if the number of grad acc. steps is < 16, it properly logged
# - all the train_loss_steps, but if it is = 16, it only logs one value, although it should log e.g. 40 values for 16 grad. acc. steps in this script
# - for 32 grad acc steps, it logs no train_loss_step at all
GRAD_ACCUMULATION_STEPS = 32

# this is irrespective of batch size, the problem persists at higher batch sizes too
BATCH_SIZE = 1

# Create the dataset and dataloader
dataset = SimpleDataset()
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)

# Create the Lightning Module and Trainer
os.environ["WANDB_API_KEY"] = "YOUR_WANDB_API_KEY"
wandb_logger = WandbLogger(project="simple-example", entity="YOUR_WANDB_USERNAME")
model = SimpleModel()
trainer = L.Trainer(
    max_epochs=2, logger=wandb_logger, accumulate_grad_batches=GRAD_ACCUMULATION_STEPS
)

# Train the model
trainer.fit(model, dataloader)

In the example, if I set GRAD_ACCUMULATION_STEPS to between 1 and 4, it creates a plot of train_loss_steps in WANDB, but if I set it to 8, it only logs a single train_loss_step value for the entire training (even though it should log 160 values). If I set it to 16, it logs no value whatsoever.

tadejsv added bug Something isn't working help wanted Open to be worked on labels Jan 11, 2021

tchaton added priority: 1 Medium priority task logger Related to the Loggers labels Jan 12, 2021

stale bot added the won't fix This will not be worked on label Feb 28, 2021

stale bot closed this as completed Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WandB dropping items when logging LR or val_loss with accumulate_grad_batches > 1 #5469

WandB dropping items when logging LR or val_loss with accumulate_grad_batches > 1 #5469

tadejsv commented Jan 11, 2021

tadejsv commented Jan 11, 2021 •

edited

Loading

borisdayma commented Jan 12, 2021

siddk commented Jan 18, 2021

tadej-redstone commented Jan 29, 2021

borisdayma commented Jan 29, 2021

tadej-redstone commented Jan 29, 2021

stale bot commented Feb 28, 2021

borisdayma commented Mar 8, 2021

Thiggel commented Nov 5, 2024

WandB dropping items when logging LR or val_loss with accumulate_grad_batches > 1 #5469

WandB dropping items when logging LR or val_loss with accumulate_grad_batches > 1 #5469

Comments

tadejsv commented Jan 11, 2021

🐛 Bug

tadejsv commented Jan 11, 2021 • edited Loading

borisdayma commented Jan 12, 2021

siddk commented Jan 18, 2021

tadej-redstone commented Jan 29, 2021

borisdayma commented Jan 29, 2021

tadej-redstone commented Jan 29, 2021

stale bot commented Feb 28, 2021

borisdayma commented Mar 8, 2021

Thiggel commented Nov 5, 2024

tadejsv commented Jan 11, 2021 •

edited

Loading