-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging validation metrics with val_check_interval
#5070
Comments
Hi! thanks for your contribution!, great first issue! |
Update: Replicated in the BoringColab with a dummy dataset: https://colab.research.google.com/drive/1_o9L1kOmr8xtB-mS0y3bWWa2o1STh2bA EDIT: I've forked pytorch-lightning locally and made some edits to the There is some architecture for logger awareness of where we are in the model lifecycle, but there isn't logic for hooking up the When replacing pytorch-lightning installation with my own via If this looks good, I'm happy to make PR. Here's a diff of my changes. Local CPU-only tests are passing (except for 2 TL;DR Made a fork where I fixed this issue. Doesn't seem to break old stuff. Happy to make PR after discussion. |
Hi @tchainzzz Within your validation loop (probably at "epoch_end") you could try to call directly |
That's a good point -- I hadn't thought of that. I've been working off of a local fork of the repo, which works. Thanks for the tip! Seems like you're trying to address this/similar problems with a feature-request -- I'll close this, and see if any of my changes might be useful for that proposal. |
This was fixed with #5931. |
❓ Questions and Help
What is your question?
My training epoch is very long, so I'd like to run mini-validation loops in the middle of the epoch, which I can do easily by setting
val_check_interval=N
. However, I'd also like to log the values outputted by those mini-validation loops.Is there a canonical way to log validation metrics within a training epoch while using
val_check_interval
in theTrainer
?Related issues:
#4980 -- However, I'm not trying to log step-wise validation metrics; I'm trying to log aggregate validation epoch metrics.
#4985 -- Similar, but I'm using the
WandbLogger
directly while this issue is more general.#4652 -- Similar issue, but this is an older version of pytorch-lightning.
Code
The below is my code for the methods that involve logging. Unfortunately, I can't post the full repo; here's the relevant configuration (loaded from YAML) for the
Trainer
andLogger
objects, which I've replicated in a Colab in the below comment.The following is the actual logging code. For context, I am training a Seq2Seq model from HuggingFace.
Expected: Logs to my
wandb
project.Actual: A warning
wandb: WARNING Step must only increase in log calls
.It seems like the global step of the validator is out of sync with my logging calls, as I get a message like
The global step numbers are definitely consistent with the
val_check_interval
provided (2000 * n
). However, I am unsure why the validation logging step results in an attempt to log to steps in increments of 250. Maybe I need to somehow sync the validation logging step to the training logging step? Or is pytorch lightning supposed to take care of this interally?Is this a bug, or am I doing something wrong/silly?
What have you tried?
Removing the
log_dict
call from validation steps and instead logging an aggregated dictionary invalidation_epoch_end
:This does not change the overall behavior.
What's your environment?
The text was updated successfully, but these errors were encountered: