Clarify the model checkpoint arguments #4335

ananthsub · 2020-10-24T19:06:42Z

🐛 Proposals

This is not so much a bug report as an RFC to clarify the ModelCheckpoint callback arguments:

save_last: to me, this means that whenever we save a checkpoint, we save a checkpoint with filename "last.ckpt". This provides a pre-determined checkpoint name, which is very helpful for resuming from failures. Importantly, it should not determine when checkpoints are saved. Currently it's easy to confuse this parameter to mean "save the checkpoint after the last epoch," which I think should be split out as a separate argument. This distinction would also clarify the typing and validation: there's no need for it to be an Optional[bool]: either we save a checkpoint as "last.ckpt" or not. So it could be a regular bool.
There's an inefficiency right now where we generate the checkpoint dict twice if save_last=True. For techniques like ZeRO that deal with sharded optimizer states, each checkpoint dict creation triggers communications across all ranks. Instead, we should gather the checkpoint dict once, and then save to different file paths accordingly (cc @justusschock @awaelchli @akihironitta @rohitgr7 @carmocca @ninginthecloud @jjenniferdai @SeanNaren, @blefaudeux)
save_top_k: since monitor is None by default, this should force save_top_k to be -1. The counterargument is that this can cause storage concerns. But I think this is easily correctable on the user-side: configure save_top_k + monitor
period: we should rename this as every_n_epochs. this opens up extensions for checkpointing after every_n_steps during training and checkpointing after a specified time interval. With those extensions in mind, period is ambiguous. Another request here is to change the default filename from "{epoch}" to "{epoch}-{global_step}" to better support mid-epoch checkpointing

The text was updated successfully, but these errors were encountered:

awaelchli · 2020-10-24T19:39:56Z

save_last: I agree, I believe this was the original intention when this feature was added. Definitely useful to have.
period: Regarding naming, we have a PR here: #3807 which needs to be finished, but implements exactly this :)

carmocca · 2020-10-24T20:08:42Z

save_last: What about having a symlink_to_last: bool = False argument so there is always a symbolic link to the last saved epoch? It'd also avoid the inefficiency you talk about in your second point.
period: I think we should keep the name period and have a mechanism to either track epochs or steps.

Also if we allow tracking epochs or steps, save_last should not be exclusive to epochs

ananthsub · 2020-10-24T20:26:43Z

save_last: What about having a symlink_to_last: bool = False argument so there is always a symbolic link to the last saved epoch? It'd also avoid the inefficiency you talk about in your second point.

I didn't see symlink supported from the fsspec API, so this could be challenging without that. but yeah that's the general idea: maybe we call it save_as_last ? save_most_recent ? open to name ideas here haha

period: I think we should keep the name period and have a mechanism to either track epochs or steps.

Users might want to do both: e.g. save a checkpoint every 10,000 steps and at each epoch

carmocca · 2020-10-24T20:36:36Z

Users might want to do both: e.g. save a checkpoint every 10,000 steps and at each epoch

Yes, but I would support that by allowing having multiple ModelCheckpoint callbacks.

ModelCheckpoint has become quite complex lately, so we should evaluate splitting it some time in the future.
Any further changes we do should line up with a thought out future API.
One possibility is separating the basic saving functionality from tracking the TopK of anything. It could be something similar to:

class ModelCheckpoint:
    verbose: bool = False
    save_weights_only: bool = False
    period: int = 1
    dirpath: Optional[Union[str, Path]] = None
    filename: Optional[str] = None
    symlink_last: bool = False  # what you propose to be save_last
    # This is missing a mechanism to track either epochs or steps

class TopKModelCheckpoint(Checkpoint):
    # Notice these are not optional anymore 
    monitor: str
    save_top_k: int = 1
    mode: str = "auto"
    save_on_end: bool = False  # what we currently call save_last

Just an idea. Open to modifications of course 😄

Edits:

(22-12-2020): Remove prefix

hadim · 2020-11-06T02:12:53Z

I also agree the current ModelCheckpoint is confusing. I really like the proposal in the first message that would clarify a lot of things (epoch/steps, save_last, etc).

I don't think symlink should be used as you never really know on which fs you are saving the checkpoint. I would let the user deal with the storage usage things.

Ideally, a single ModelCheckpoint class would be best (IMO) instead of two but I guess it's a matter of taste at this point.

kazhang · 2020-11-06T08:08:13Z

I have a custom checkpoint callback that inherits from the ModelCheckpoint callback to support every_n_steps by saving all(save_top_k=-1 and monitor=None) and skip if not every n steps ((global_step + 1) % every_n_steps != 0)

At the same time, I also want to store the last checkpoint(save_last=True) so that we could resume training from crash.

However, this exception prevent me from doing so:

if self.save_last:
    raise MisconfigurationException(
        'ModelCheckpoint(save_last=True, monitor=None) is not a valid configuration.'
        ' You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None)'
    )

I'm wondering if we could relax that constraint by giving a warning instead?

carmocca · 2020-11-06T19:48:59Z

I'm wondering if we could relax that constraint by giving a warning instead?

Yes! this was already discussed in slack. Feel free to open a PR.

Ananth S Oct 15th at 6:23 AM
Why does this ModelCheckpoint instantiation raise a misconfiguration error?
Isn't saving the last checkpoint independent of the monitor/topK?

Adrian  23 days ago
@carmocca do you remember?

carmocca  22 days ago
Because monitor=None already saves the last ckpt (but not named last.ckpt) so why would you want to save last again (but named last.ckpt) (edited) 

carmocca  22 days ago
Maybe we should have used the phrase "is a redundant configuration" instead of valid

carmocca  22 days ago
save_last only makes sense when something is being monitored

Ananth S  22 days ago
i see. we are using torchelastic for long training runs. using save_last gives us a checkpoint with name last.ckpt , which is a deterministic path name to find and resume training from

Ananth S  22 days ago
what we really want is "save the most recent checkpoint as last.ckpt" (edited) 

Adrian  22 days ago
That is the original meaning of save_last. To have just that name. It should be a copy of the same file epoch=...cktp

Adrian  22 days ago
I definitely agree this is useful to have

carmocca  21 days ago
Feel free to send a PR changing "raise MisconfigurationException" to a warning. Also changing "is not a valid" for "is a redundant"

rohitgr7 · 2020-11-22T20:29:06Z

save_last only makes sense when something is being monitored

why is that??
IMO save_last should have nothing to do with the monitor or save_top_k. why do we have this warning??

with

ModelCheckpoint(save_last=True, monitor=None, save_top_k=None|0|-1)

it raises this warning

but I think, with

ModelCheckpoint(save_last=True, monitor=anything, save_top_k=anything)

it should not raise any warning or exceptions w.r.t to save_last

carmocca · 2020-11-22T20:52:31Z

why is that??

From my last comment:

Because monitor=None already saves the last ckpt (but not named last.ckpt) so why would you want to save last again (but named last.ckpt)

rohitgr7 · 2020-11-22T21:25:23Z

Because monitor=None already saves the last ckpt (but not named last.ckpt)

oh ok. Then I suggest it should be reverted back to an exception, with an additional condition of self.save_last and self.save_top_k == -1 since saving the same checkpoint twice doesn't make sense. If one needs to access the last checkpoint, the path available in besk_k_models with epoch key. Also, save_last should be independent of the monitor, I think.

awaelchli · 2020-11-22T22:34:09Z

the original feature of save_last was to just have a copy of epoch=x.ckpt saved as last.ckpt. This would happen regardless of other settings. See original PR: #1908 and original feature request: #1658.
The benefit of having this file is to be able to load/resume the checkpoint programmatically through last.ckpt without the need of knowing what the last epoch was. This is also what @ananthsub describes here in the original message.
I hope we can keep this feature as I find it is very useful.

carmocca · 2020-11-23T01:22:02Z

I believe we all get confused because save_last can mean two different things.

A copy of the last checkpoint saved (regardless of anything else). This would ideally be a symlink.
A checkpoint of the model just before training is over.

where save_last corresponds to (2). Note that (1) and (2) are the same thing in some circumstances.

In #4335 (comment) I propose having (1) and (2) as symlink_to_last and save_on_end respectively for clarity.

Also, save_last should be independent of the monitor, I think.

It should if we are talking about (1). If we are talking about (2), it is redundant because if monitor is None we will save every validation run.

edenlightning · 2021-06-30T23:03:08Z

@carmocca what is left TODO here?

carmocca · 2021-07-01T21:36:33Z

To implement the split described in #4335 (comment)

Which would clarify the arguments, simplify the code, and facilitate custom ModelCheckpoint extensions

jzazo · 2023-02-07T12:13:21Z

Can we revive this discussion? I was also confused about this. To me, save_last = True should mean save the last checkpoint when training finishes. If we are using early stopping, or training finishes normally, save the last. This would also add the latest tag to loggers.

monitor, mode and every_n_epochs / every_n_steps should work together for checkpointing every specified frequency. This would add the best tag to loggers.

Have a symlink on latest checkpointed model for failure recovery purposes (which will rely on the every_n_epochs / every_n_steps setting). This would also add the latest tag to loggers, which would link to the same checkpoint as the one save_last = True would make if set.

Also, right now, filename setting only works if save_last = False. Otherwise, everything is called last.cpkt (unless rename attribute is overriden).

With the above, the symlink can be named last.cpkt, and the rest of checkpoints whatever filename specified, and all settings would work together.

Would this behavior work with everything that has been said in this discussion?

donglihe-hub · 2024-01-07T13:53:04Z

ModelCheckpoint has become quite complex lately, so we should evaluate splitting it some time in the future.
Any further changes we do should line up with a thought out future API.

Hi @carmocca . I see your proposed argument "save_on_end" is really helpful to me since saving model to disk is much more time-consuming than saving to memory. To address the issue I have to write my own ModelCheckpoint. I'm wondering will Lightning enable such functionality provided by save_on_end in the future?

ananthsub added bug Something isn't working help wanted Open to be worked on labels Oct 24, 2020

This was referenced Oct 24, 2020

deprecate passing ModelCheckpoint instance to Trainer(checkpoint_callback=...) #4336

Merged

Add step index in checkpoint name #3807

Merged

ydcjeff added checkpointing Related to checkpointing feature Is an improvement or enhancement and removed bug Something isn't working labels Oct 25, 2020

edenlightning added this to the 1.1 milestone Oct 27, 2020

kazhang mentioned this issue Nov 6, 2020

Replace a MisconfigurationException with warning in ModelCheckpoint callback #4560

Merged

11 tasks

awaelchli mentioned this issue Nov 22, 2020

ModelCheckpoint ignores dirpath may result in permission denied error #4808

Closed

Borda modified the milestones: 1.1, 1.1.x Nov 30, 2020

carmocca mentioned this issue Dec 9, 2020

Add version suffix to last.ckpt #5030

Closed

Borda modified the milestones: 1.1.x, 1.2 Dec 30, 2020

edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021

carmocca self-assigned this Feb 22, 2021

carmocca mentioned this issue Feb 23, 2021

Fix ModelCheckpoint(monitor=None, save_last=True) not saving checkpoints #6136

Merged

10 tasks

carmocca mentioned this issue Mar 14, 2021

[RFC] Create a ModelCheckpointBase callback #6504

Closed

carmocca modified the milestones: v1.3, v1.4 Apr 26, 2021

edenlightning modified the milestones: v1.4, v1.5 Jul 6, 2021

awaelchli modified the milestones: v1.5, v1.6 Nov 4, 2021

carmocca modified the milestones: 1.6, None Feb 1, 2022

carmocca added callback: model checkpoint refactor and removed checkpointing Related to checkpointing feature Is an improvement or enhancement labels Feb 3, 2022

carmocca mentioned this issue Mar 21, 2022

Add required states for resumed ModelCheckpoint GC #10995

Merged

12 tasks

carmocca mentioned this issue Aug 29, 2022

ModelCheckpoint save ckpt before Validation on steps #14431

Closed

carmocca mentioned this issue Oct 4, 2022

Store last.ckpt as symlink when appropriate to save space #14973

Closed

awaelchli mentioned this issue Oct 8, 2023

Save ModelCheckpoint's last.ckpt as symlink if possible #18748

Merged

awaelchli mentioned this issue Jan 27, 2024

ModelCheckpoint does not save the last checkpoint on local file system #19325

Closed

carmocca removed their assignment Jul 30, 2024

godaup mentioned this issue Sep 3, 2024

ModelCheckpoint's save_last does not adhere to documentation #20245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify the model checkpoint arguments #4335

Clarify the model checkpoint arguments #4335

ananthsub commented Oct 24, 2020 •

edited by github-actions bot

Loading

awaelchli commented Oct 24, 2020

carmocca commented Oct 24, 2020 •

edited

Loading

ananthsub commented Oct 24, 2020

carmocca commented Oct 24, 2020 •

edited

Loading

hadim commented Nov 6, 2020

kazhang commented Nov 6, 2020

carmocca commented Nov 6, 2020 •

edited

Loading

rohitgr7 commented Nov 22, 2020 •

edited

Loading

carmocca commented Nov 22, 2020

rohitgr7 commented Nov 22, 2020

awaelchli commented Nov 22, 2020

carmocca commented Nov 23, 2020

edenlightning commented Jun 30, 2021

carmocca commented Jul 1, 2021

jzazo commented Feb 7, 2023 •

edited

Loading

donglihe-hub commented Jan 7, 2024 •

edited

Loading

Clarify the model checkpoint arguments #4335

Clarify the model checkpoint arguments #4335

Comments

ananthsub commented Oct 24, 2020 • edited by github-actions bot Loading

🐛 Proposals

awaelchli commented Oct 24, 2020

carmocca commented Oct 24, 2020 • edited Loading

ananthsub commented Oct 24, 2020

carmocca commented Oct 24, 2020 • edited Loading

hadim commented Nov 6, 2020

kazhang commented Nov 6, 2020

carmocca commented Nov 6, 2020 • edited Loading

rohitgr7 commented Nov 22, 2020 • edited Loading

carmocca commented Nov 22, 2020

rohitgr7 commented Nov 22, 2020

awaelchli commented Nov 22, 2020

carmocca commented Nov 23, 2020

edenlightning commented Jun 30, 2021

carmocca commented Jul 1, 2021

jzazo commented Feb 7, 2023 • edited Loading

donglihe-hub commented Jan 7, 2024 • edited Loading

ananthsub commented Oct 24, 2020 •

edited by github-actions bot

Loading

carmocca commented Oct 24, 2020 •

edited

Loading

carmocca commented Oct 24, 2020 •

edited

Loading

carmocca commented Nov 6, 2020 •

edited

Loading

rohitgr7 commented Nov 22, 2020 •

edited

Loading

jzazo commented Feb 7, 2023 •

edited

Loading

donglihe-hub commented Jan 7, 2024 •

edited

Loading