Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

best_model_path does not retrieve the path to the best monitor checkpoint file #12485

Closed
ShaneTian opened this issue Mar 28, 2022 · 8 comments
Closed
Labels
callback: model checkpoint won't fix This will not be worked on

Comments

@ShaneTian
Copy link

ShaneTian commented Mar 28, 2022

🐛 Bug

If there are more than one ModelCheckpoint, and the first one in callback list does NOT include monitor, the self.checkpoint_callback.best_model_path will be wrong (It is not best monitor).
e.g.

callbacks = []
val_ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="val_end-{epoch}-{step}-{val_loss:.4f}-{val_ppl:.4f}",
    save_top_k=-1,
    every_n_epochs=1
)
callbacks.append(val_ckpt_callback)
monitor_ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="monitor-{epoch}-{step}-{" + my_monitor + ":.4f}",
    monitor=my_monitor,
    save_top_k=1
)
callbacks.append(monitor_ckpt_callback)

Related code:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b2e98d61661fca80b87e1e2b49cd301d29667ce5/pytorch_lightning/trainer/trainer.py#L2342-L2353

To Reproduce

Expected behavior

Always save best monitor model checkpoint.

Environment

  • CUDA:
    - GPU:
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - available: True
    - version: 11.3
  • Packages:
    - numpy: 1.20.1
    - pyTorch_debug: False
    - pyTorch_version: 1.10.2+cu113
    - pytorch-lightning: 1.5.10
    - tqdm: 4.63.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor: x86_64
    - python: 3.7.10
    - version: Proposal for help #1 SMP Fri Mar 19 10:07:22 CST 2021

cc @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7

@ShaneTian ShaneTian added the needs triage Waiting to be triaged by maintainers label Mar 28, 2022
@rohitgr7
Copy link
Contributor

didn't get you here, can you explain more on what's the actual issue? if there is no monitor, best_model_path should be set to nothing.

@ShaneTian
Copy link
Author

I mean, when I have multiple ModelCheckpoints and one contains monitor, if I want to get the best monitor checkpoint by trainer.test(..., ckpt="best"), I have to put the monitor ModelCheckpoint in the first one of callback list, right?

@rohitgr7
Copy link
Contributor

well, yeah that's true or you can pass the checkpoint path directly.

trainer.test(..., ckpt_path=checkpoint_callback2.best_model_path)

ckpt='best' selects the first checkpoint callback and extracts the best model path from it. It was kept like this to have a quick handy feature for users since, in the majority of the cases, there's usually one checkpoint callback.

maybe we could extend it a little and if best is selected, we can raise warnings/errors in such a case if multiple model checkpoint callbacks are configured.

cc @carmocca wdyt?

@carmocca
Copy link
Contributor

I agree with raising a warning in this case.

@carmocca carmocca added callback: model checkpoint and removed needs triage Waiting to be triaged by maintainers labels Mar 28, 2022
@carmocca
Copy link
Contributor

I mean, when I have multiple ModelCheckpoints and one contains monitor, if I want to get the best monitor checkpoint by trainer.test(..., ckpt="best"), I have to put the monitor ModelCheckpoint in the first one of callback list, right?

ModelCheckpoints without a monitor (aka monitor=None) still set the best_model_path attribute so that passing ckpt_path='best' still works for them:
https://github.com/PyTorchLightning/pytorch-lightning/blob/fdcc09cf95c3211e1267f78ad91d515dc4809ef9/pytorch_lightning/callbacks/model_checkpoint.py#L653

So we wouldn't be able to filter by the instances that have an actual monitor.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Mar 28, 2022

if the monitor is None, why do we need to save the best_model_path?

anyway, my idea was to still raise a warning just to let users know that there are multiple checkpoint callbacks and best is used from the first one.

@carmocca
Copy link
Contributor

carmocca commented Mar 28, 2022

if the monitor is None, why do we need to save the best_model_path?

As I said in my previous message: "so that passing ckpt_path='best' still works for them"

People want to be able to pass ckpt_path='best' regardless of their monitor config. In this case, it would equal the last checkpoint saved.

This behavior could be changed if we have #11912

@stale
Copy link

stale bot commented Apr 27, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 27, 2022
@stale stale bot closed this as completed Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback: model checkpoint won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants