Logging validation metrics with `val_check_interval` #5070

tchang1997 · 2020-12-10T22:05:23Z

❓ Questions and Help

What is your question?

My training epoch is very long, so I'd like to run mini-validation loops in the middle of the epoch, which I can do easily by setting val_check_interval=N. However, I'd also like to log the values outputted by those mini-validation loops.

Is there a canonical way to log validation metrics within a training epoch while using val_check_interval in the Trainer?

Related issues:

#4980 -- However, I'm not trying to log step-wise validation metrics; I'm trying to log aggregate validation epoch metrics.
#4985 -- Similar, but I'm using the WandbLogger directly while this issue is more general.
#4652 -- Similar issue, but this is an older version of pytorch-lightning.

Code

The below is my code for the methods that involve logging. Unfortunately, I can't post the full repo; here's the relevant configuration (loaded from YAML) for the Trainer and Logger objects, which I've replicated in a Colab in the below comment.

trainer:
    gradient_clip_val: 0.1
    accumulate_grad_batches: 8
    num_sanity_val_steps: 2
    gpus: 1
    max_epochs: 1
    val_check_interval: 2000
    limit_val_batches: 100
    move_metrics_to_cpu: True
logging:
    save_dir: "sentcomp_logs/"
    project: "sent-comp"
    offline: False
    name: "bart_full_run"

The following is the actual logging code. For context, I am training a Seq2Seq model from HuggingFace.

# THIS WORKS FINE!
    def training_step(self, batch: BatchEncoding, batch_idx: int) -> torch.Tensor:
        outputs = self(**batch)
        loss = outputs[0]
        self.log("train_loss", loss, prog_bar=True, sync_dist=True)
        return loss

# DOES NOT WORK
    def validation_step(self, batch: BatchEncoding, batch_idx: int) -> Dict[str, torch.Tensor]:
        outputs = self(**batch)
        val_loss = outputs[0]
        predictions, targets = self.decode(batch) # returns lists of strings
        results = self.score(predictions, targets) # returns a dictionary
        self.log("val_loss", val_loss, sync_dist=True)
        self.log_dict(results, sync_dist=True)
        return results

Expected: Logs to my wandb project.
Actual: A warning wandb: WARNING Step must only increase in log calls.

It seems like the global step of the validator is out of sync with my logging calls, as I get a message like

Step 249 < 2000; dropping { <my metrics> }
Step 499 < 4000; dropping { <my metrics> }
Step 749 < 6000; dropping { <my metrics> }

The global step numbers are definitely consistent with the val_check_interval provided (2000 * n). However, I am unsure why the validation logging step results in an attempt to log to steps in increments of 250. Maybe I need to somehow sync the validation logging step to the training logging step? Or is pytorch lightning supposed to take care of this interally?

Is this a bug, or am I doing something wrong/silly?

What have you tried?

Removing the log_dict call from validation steps and instead logging an aggregated dictionary in validation_epoch_end:

def validation_epoch_end(self, batches: List[Dict[str, torch.Tensor]]) -> None:
    metrics = {}         
    metrics[metric_key] = torch.stack([batch[metric_key] for batch in batches]).mean()  
    self.log_dict(metrics, sync_dist=True)

This does not change the overall behavior.

What's your environment?

OS: Ubuntu 18.04
Packaging: pip
Version: pytorch-lightning 1.1.0 (Master), pytorch 1.7.0, cuda 11.0

The text was updated successfully, but these errors were encountered:

github-actions · 2020-12-10T22:06:13Z

Hi! thanks for your contribution!, great first issue!

tchang1997 · 2020-12-11T00:19:53Z

Update:

Replicated in the BoringColab with a dummy dataset: https://colab.research.google.com/drive/1_o9L1kOmr8xtB-mS0y3bWWa2o1STh2bA

EDIT: I've forked pytorch-lightning locally and made some edits to the trainer.py, training_loop.py, evaluation_loop.py, and logger_connector.py files, such that when logging, the logger is aware of when it's called from within the training loop, even when an evaluation loop is called from within training (which happens when val_check_interval < 1 or < num_batches).

There is some architecture for logger awareness of where we are in the model lifecycle, but there isn't logic for hooking up the EvaluationLoop object with the logger when it's called from within training, which makes sense for 99% of use-cases -- but for monitoring validation metrics during a long training epoch, the WandbLogger previously threw a bunch of warnings and dropped the logs. Basically, the only change is a keyword that gets propagated down from the evaluation loop call to the logger connector.

When replacing pytorch-lightning installation with my own via pip install git+https://github.com/tchainzzz/pytorch-lightning.git@master --upgrade, I get the expected behavior I outlined above in the Colab. To make sure the logging was functional, I inspected the generated logfile, and performed online runs here, testing both a run with val_check_interval set to <1 epoch and val_check_interval as default.

If this looks good, I'm happy to make PR. Here's a diff of my changes. Local CPU-only tests are passing (except for 2 onnxruntime dependency issues -- I can't get it installed). Granted, this feels like a bit of a niche/janky feature to me (having the eval loop being aware that it's "inside" the training loop feels intuitively odd), but it's a QOL update for using val_check_interval that suits my needs perfectly.

TL;DR Made a fork where I fixed this issue. Doesn't seem to break old stuff. Happy to make PR after discussion.

borisdayma · 2020-12-19T20:27:50Z

Hi @tchainzzz

Within your validation loop (probably at "epoch_end") you could try to call directly wandb.log({'val_loss':loss}, commit=False) which would just use the last logged step.

tchang1997 · 2020-12-21T20:34:32Z

That's a good point -- I hadn't thought of that. I've been working off of a local fork of the repo, which works. Thanks for the tip!

Seems like you're trying to address this/similar problems with a feature-request -- I'll close this, and see if any of my changes might be useful for that proposal.

borisdayma · 2021-03-08T02:52:52Z

This was fixed with #5931.
You can now just log at any time with self.log('my_metric', my_value) and you won't have any dropped value.
Just choose your x-axis appropriately in the UI (whether global_step or just the auto-incremented step).

tchang1997 added the question Further information is requested label Dec 10, 2020

borisdayma mentioned this issue Dec 19, 2020

WandbLogger: add more flexibility with logging step #5194

Closed

tchang1997 closed this as completed Dec 21, 2020

borisdayma mentioned this issue Jan 4, 2021

feat(wandb): add sync_step #5351

Merged

12 tasks

rohitgr7 linked a pull request Jan 6, 2021 that will close this issue

feat(wandb): add sync_step #5351

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging validation metrics with `val_check_interval` #5070

Logging validation metrics with `val_check_interval` #5070

tchang1997 commented Dec 10, 2020 •

edited

Loading

github-actions bot commented Dec 10, 2020

tchang1997 commented Dec 11, 2020 •

edited

Loading

borisdayma commented Dec 19, 2020

tchang1997 commented Dec 21, 2020

borisdayma commented Mar 8, 2021

Logging validation metrics with val_check_interval #5070

Logging validation metrics with val_check_interval #5070

Comments

tchang1997 commented Dec 10, 2020 • edited Loading

❓ Questions and Help

What is your question?

Related issues:

Code

What have you tried?

What's your environment?

github-actions bot commented Dec 10, 2020

tchang1997 commented Dec 11, 2020 • edited Loading

TL;DR Made a fork where I fixed this issue. Doesn't seem to break old stuff. Happy to make PR after discussion.

borisdayma commented Dec 19, 2020

tchang1997 commented Dec 21, 2020

borisdayma commented Mar 8, 2021

Logging validation metrics with `val_check_interval` #5070

Logging validation metrics with `val_check_interval` #5070

tchang1997 commented Dec 10, 2020 •

edited

Loading

tchang1997 commented Dec 11, 2020 •

edited

Loading