`best_model_path` does not retrieve the path to the best monitor checkpoint file #12485

ShaneTian · 2022-03-28T09:28:48Z

🐛 Bug

If there are more than one ModelCheckpoint, and the first one in callback list does NOT include monitor, the self.checkpoint_callback.best_model_path will be wrong (It is not best monitor).
e.g.

callbacks = []
val_ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="val_end-{epoch}-{step}-{val_loss:.4f}-{val_ppl:.4f}",
    save_top_k=-1,
    every_n_epochs=1
)
callbacks.append(val_ckpt_callback)
monitor_ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="monitor-{epoch}-{step}-{" + my_monitor + ":.4f}",
    monitor=my_monitor,
    save_top_k=1
)
callbacks.append(monitor_ckpt_callback)

Related code:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b2e98d61661fca80b87e1e2b49cd301d29667ce5/pytorch_lightning/trainer/trainer.py#L2342-L2353

To Reproduce

Expected behavior

Always save best monitor model checkpoint.

Environment

CUDA:
- GPU:
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- available: True
- version: 11.3
Packages:
- numpy: 1.20.1
- pyTorch_debug: False
- pyTorch_version: 1.10.2+cu113
- pytorch-lightning: 1.5.10
- tqdm: 4.63.0
System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.10
- version: Proposal for help #1 SMP Fri Mar 19 10:07:22 CST 2021

cc @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7

The text was updated successfully, but these errors were encountered:

rohitgr7 · 2022-03-28T13:35:14Z

didn't get you here, can you explain more on what's the actual issue? if there is no monitor, best_model_path should be set to nothing.

ShaneTian · 2022-03-28T13:42:04Z

I mean, when I have multiple ModelCheckpoints and one contains monitor, if I want to get the best monitor checkpoint by trainer.test(..., ckpt="best"), I have to put the monitor ModelCheckpoint in the first one of callback list, right?

rohitgr7 · 2022-03-28T14:09:11Z

well, yeah that's true or you can pass the checkpoint path directly.

trainer.test(..., ckpt_path=checkpoint_callback2.best_model_path)

ckpt='best' selects the first checkpoint callback and extracts the best model path from it. It was kept like this to have a quick handy feature for users since, in the majority of the cases, there's usually one checkpoint callback.

maybe we could extend it a little and if best is selected, we can raise warnings/errors in such a case if multiple model checkpoint callbacks are configured.

cc @carmocca wdyt?

carmocca · 2022-03-28T14:38:56Z

I agree with raising a warning in this case.

carmocca · 2022-03-28T14:49:26Z

I mean, when I have multiple ModelCheckpoints and one contains monitor, if I want to get the best monitor checkpoint by trainer.test(..., ckpt="best"), I have to put the monitor ModelCheckpoint in the first one of callback list, right?

ModelCheckpoints without a monitor (aka monitor=None) still set the best_model_path attribute so that passing ckpt_path='best' still works for them:
https://github.com/PyTorchLightning/pytorch-lightning/blob/fdcc09cf95c3211e1267f78ad91d515dc4809ef9/pytorch_lightning/callbacks/model_checkpoint.py#L653

So we wouldn't be able to filter by the instances that have an actual monitor.

rohitgr7 · 2022-03-28T14:53:10Z

if the monitor is None, why do we need to save the best_model_path?

anyway, my idea was to still raise a warning just to let users know that there are multiple checkpoint callbacks and best is used from the first one.

carmocca · 2022-03-28T14:56:57Z

if the monitor is None, why do we need to save the best_model_path?

As I said in my previous message: "so that passing ckpt_path='best' still works for them"

People want to be able to pass ckpt_path='best' regardless of their monitor config. In this case, it would equal the last checkpoint saved.

This behavior could be changed if we have #11912

stale · 2022-04-27T23:06:09Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ShaneTian added the needs triage Waiting to be triaged by maintainers label Mar 28, 2022

carmocca added callback: model checkpoint and removed needs triage Waiting to be triaged by maintainers labels Mar 28, 2022

This was referenced Mar 30, 2022

Raise a WARNING when someone tried to load the best checkpoint when one has not been set. #12501

Closed

[RFC] Create a ModelCheckpointBase callback #6504

Closed

stale bot added the won't fix This will not be worked on label Apr 27, 2022

stale bot closed this as completed Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`best_model_path` does not retrieve the path to the best monitor checkpoint file #12485

`best_model_path` does not retrieve the path to the best monitor checkpoint file #12485

ShaneTian commented Mar 28, 2022 •

edited by github-actions bot

Loading

rohitgr7 commented Mar 28, 2022

ShaneTian commented Mar 28, 2022

rohitgr7 commented Mar 28, 2022

carmocca commented Mar 28, 2022

carmocca commented Mar 28, 2022

rohitgr7 commented Mar 28, 2022 •

edited

Loading

carmocca commented Mar 28, 2022 •

edited

Loading

stale bot commented Apr 27, 2022

best_model_path does not retrieve the path to the best monitor checkpoint file #12485

best_model_path does not retrieve the path to the best monitor checkpoint file #12485

Comments

ShaneTian commented Mar 28, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

rohitgr7 commented Mar 28, 2022

ShaneTian commented Mar 28, 2022

rohitgr7 commented Mar 28, 2022

carmocca commented Mar 28, 2022

carmocca commented Mar 28, 2022

rohitgr7 commented Mar 28, 2022 • edited Loading

carmocca commented Mar 28, 2022 • edited Loading

stale bot commented Apr 27, 2022

`best_model_path` does not retrieve the path to the best monitor checkpoint file #12485

`best_model_path` does not retrieve the path to the best monitor checkpoint file #12485

ShaneTian commented Mar 28, 2022 •

edited by github-actions bot

Loading

rohitgr7 commented Mar 28, 2022 •

edited

Loading

carmocca commented Mar 28, 2022 •

edited

Loading