Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WandB dropping items when logging LR or val_loss with accumulate_grad_batches > 1 #5469

Closed
tadejsv opened this issue Jan 11, 2021 · 9 comments
Labels
bug Something isn't working help wanted Open to be worked on logger Related to the Loggers priority: 1 Medium priority task won't fix This will not be worked on

Comments

@tadejsv
Copy link
Contributor

tadejsv commented Jan 11, 2021

🐛 Bug

As you can see in the BoringModel, I get the following warnings from WandB logger:

wandb: WARNING Step must only increase in log calls.  Step 49 < 98; dropping {'lr-SGD': 0.1}.
wandb: WARNING Step must only increase in log calls.  Step 99 < 198; dropping {'lr-SGD': 0.1}.
wandb: WARNING Step must only increase in log calls.  Step 149 < 199; dropping {'lr-SGD': 0.1}.
wandb: WARNING Step must only increase in log calls.  Step 149 < 298; dropping {'lr-SGD': 0.1}.
wandb: WARNING Step must only increase in log calls.  Step 156 < 299; dropping {'val_loss': 3.9880808209860966e-14, 'epoch': 0}.

This occurs when I add the following to the basic BordigModel:

  • log train loss with self.log()
  • add WandB logger
  • add accumulate_grad_batches > 1
  • either add LearningRateMonitor callback or log validation loss with self.log() (or both, as in the colab)

If any of these things is removed, the error doesn't occur.

The end result is that the LR metrics are note being logged at all. Worse than that, validation loss (and any other metrics that there would be!) do not get logged, making the logger useless.

@tadejsv tadejsv added bug Something isn't working help wanted Open to be worked on labels Jan 11, 2021
@tadejsv
Copy link
Contributor Author

tadejsv commented Jan 11, 2021

Also if it helps: this occurs in every version down to (and including) 1.1.0, but does not occur in 1.0.8

@tchaton tchaton added priority: 1 Medium priority task logger Related to the Loggers labels Jan 12, 2021
@borisdayma
Copy link
Contributor

Do you have any call to wandb.log() in your validation loop. If so, you can try to add commit=False.
It seems that for some reason the wrong step is being used in the validation loop. However self.log() is supposed to handle correctly the right step (this could be an issue with gradient accumulation too).

You will be able to bypass these issues with #5194 by using sync_step=False.

@siddk
Copy link

siddk commented Jan 18, 2021

Any update on when #5194 will be merged in (sync_step=False functionality).

Additionally, I'm not sure that sync_step=False will entirely solve the gradient accumulation problem. Specifically, this line just means that the Logger will try logging at trainer.global_step which is still out of sync with validation?

For example, here's an example use-case. I set max_steps=100 in my trainer, and val_check_interval=50 (validates every 50 steps). Gradient Accumulation is set with accumulate_grad_batches=2.

global_step for my Trainer will reflect a range from 0-200, while for validation, it'll show (50, 100), even though it should map directly to (100, 200).

Is there a way to fix this more cleanly? Pinging @SeanNaren because I've discussed this problem with him via the Lightning Slack.

@tadej-redstone
Copy link

Any update on this?

@borisdayma
Copy link
Contributor

The wandb workaround has been merged #5194 and will be available in PL v1.2.

@tadej-redstone
Copy link

Thanks, can confirm this is the case in the BoringModel notebook

@stale
Copy link

stale bot commented Feb 28, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Feb 28, 2021
@stale stale bot closed this as completed Mar 8, 2021
@borisdayma
Copy link
Contributor

This was fixed with #5931.
You can now just log at any time with self.log('my_metric', my_value) and you won't have any dropped value.
Just choose your x-axis appropriately in the UI (whether global_step or just the auto-incremented step).

@Thiggel
Copy link

Thiggel commented Nov 5, 2024

I am still running into this problem, but weirdly enough only at accumulate_grad_batches > 8.

Here is a minimal example:

import os
import lightning.pytorch as L
from lightning.pytorch.loggers import WandbLogger
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import Adam


class SimpleDataset(Dataset):
    def __init__(self, size=640, input_dim=10, output_dim=1):
        self.X = torch.randn(size, input_dim)
        self.y = torch.randn(size, output_dim)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]


class SimpleModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = torch.nn.functional.mse_loss(y_hat, y)
        self.log(
            "train_loss",
            loss,
            on_step=True,
            on_epoch=True,
            prog_bar=True,
            logger=True,
        )
        print(f"Step: {self.global_step}")
        return loss

    def configure_optimizers(self):
        return Adam(self.parameters(), lr=1e-3)


# - if the number of grad acc. steps is < 16, it properly logged
# - all the train_loss_steps, but if it is = 16, it only logs one value, although it should log e.g. 40 values for 16 grad. acc. steps in this script
# - for 32 grad acc steps, it logs no train_loss_step at all
GRAD_ACCUMULATION_STEPS = 32

# this is irrespective of batch size, the problem persists at higher batch sizes too
BATCH_SIZE = 1

# Create the dataset and dataloader
dataset = SimpleDataset()
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)

# Create the Lightning Module and Trainer
os.environ["WANDB_API_KEY"] = "YOUR_WANDB_API_KEY"
wandb_logger = WandbLogger(project="simple-example", entity="YOUR_WANDB_USERNAME")
model = SimpleModel()
trainer = L.Trainer(
    max_epochs=2, logger=wandb_logger, accumulate_grad_batches=GRAD_ACCUMULATION_STEPS
)

# Train the model
trainer.fit(model, dataloader)

In the example, if I set GRAD_ACCUMULATION_STEPS to between 1 and 4, it creates a plot of train_loss_steps in WANDB, but if I set it to 8, it only logs a single train_loss_step value for the entire training (even though it should log 160 values). If I set it to 16, it logs no value whatsoever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on logger Related to the Loggers priority: 1 Medium priority task won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants