Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the model checkpoint arguments #4335

Open
ananthsub opened this issue Oct 24, 2020 · 16 comments
Open

Clarify the model checkpoint arguments #4335

ananthsub opened this issue Oct 24, 2020 · 16 comments

Comments

@ananthsub
Copy link
Contributor

ananthsub commented Oct 24, 2020

🐛 Proposals

This is not so much a bug report as an RFC to clarify the ModelCheckpoint callback arguments:

  • save_last: to me, this means that whenever we save a checkpoint, we save a checkpoint with filename "last.ckpt". This provides a pre-determined checkpoint name, which is very helpful for resuming from failures. Importantly, it should not determine when checkpoints are saved. Currently it's easy to confuse this parameter to mean "save the checkpoint after the last epoch," which I think should be split out as a separate argument. This distinction would also clarify the typing and validation: there's no need for it to be an Optional[bool]: either we save a checkpoint as "last.ckpt" or not. So it could be a regular bool.

  • There's an inefficiency right now where we generate the checkpoint dict twice if save_last=True. For techniques like ZeRO that deal with sharded optimizer states, each checkpoint dict creation triggers communications across all ranks. Instead, we should gather the checkpoint dict once, and then save to different file paths accordingly (cc @justusschock @awaelchli @akihironitta @rohitgr7 @carmocca @ninginthecloud @jjenniferdai @SeanNaren, @blefaudeux)

  • save_top_k: since monitor is None by default, this should force save_top_k to be -1. The counterargument is that this can cause storage concerns. But I think this is easily correctable on the user-side: configure save_top_k + monitor

  • period: we should rename this as every_n_epochs. this opens up extensions for checkpointing after every_n_steps during training and checkpointing after a specified time interval. With those extensions in mind, period is ambiguous. Another request here is to change the default filename from "{epoch}" to "{epoch}-{global_step}" to better support mid-epoch checkpointing

cc @awaelchli @carmocca

@ananthsub ananthsub added bug Something isn't working help wanted Open to be worked on labels Oct 24, 2020
@awaelchli
Copy link
Contributor

save_last: I agree, I believe this was the original intention when this feature was added. Definitely useful to have.
period: Regarding naming, we have a PR here: #3807 which needs to be finished, but implements exactly this :)

@carmocca
Copy link
Contributor

carmocca commented Oct 24, 2020

  • save_last: What about having a symlink_to_last: bool = False argument so there is always a symbolic link to the last saved epoch? It'd also avoid the inefficiency you talk about in your second point.
  • period: I think we should keep the name period and have a mechanism to either track epochs or steps.

Also if we allow tracking epochs or steps, save_last should not be exclusive to epochs

@ananthsub
Copy link
Contributor Author

save_last: What about having a symlink_to_last: bool = False argument so there is always a symbolic link to the last saved epoch? It'd also avoid the inefficiency you talk about in your second point.

I didn't see symlink supported from the fsspec API, so this could be challenging without that. but yeah that's the general idea: maybe we call it save_as_last ? save_most_recent ? open to name ideas here haha

period: I think we should keep the name period and have a mechanism to either track epochs or steps.

Users might want to do both: e.g. save a checkpoint every 10,000 steps and at each epoch

@carmocca
Copy link
Contributor

carmocca commented Oct 24, 2020

Users might want to do both: e.g. save a checkpoint every 10,000 steps and at each epoch

Yes, but I would support that by allowing having multiple ModelCheckpoint callbacks.

ModelCheckpoint has become quite complex lately, so we should evaluate splitting it some time in the future.
Any further changes we do should line up with a thought out future API.
One possibility is separating the basic saving functionality from tracking the TopK of anything. It could be something similar to:

class ModelCheckpoint:
    verbose: bool = False
    save_weights_only: bool = False
    period: int = 1
    dirpath: Optional[Union[str, Path]] = None
    filename: Optional[str] = None
    symlink_last: bool = False  # what you propose to be save_last
    # This is missing a mechanism to track either epochs or steps

class TopKModelCheckpoint(Checkpoint):
    # Notice these are not optional anymore 
    monitor: str
    save_top_k: int = 1
    mode: str = "auto"
    save_on_end: bool = False  # what we currently call save_last

Just an idea. Open to modifications of course 😄

Edits:

  • (22-12-2020): Remove prefix

@ydcjeff ydcjeff added checkpointing Related to checkpointing feature Is an improvement or enhancement and removed bug Something isn't working labels Oct 25, 2020
@edenlightning edenlightning added this to the 1.1 milestone Oct 27, 2020
@hadim
Copy link
Contributor

hadim commented Nov 6, 2020

I also agree the current ModelCheckpoint is confusing. I really like the proposal in the first message that would clarify a lot of things (epoch/steps, save_last, etc).

I don't think symlink should be used as you never really know on which fs you are saving the checkpoint. I would let the user deal with the storage usage things.

Ideally, a single ModelCheckpoint class would be best (IMO) instead of two but I guess it's a matter of taste at this point.

@kazhang
Copy link
Contributor

kazhang commented Nov 6, 2020

I have a custom checkpoint callback that inherits from the ModelCheckpoint callback to support every_n_steps by saving all(save_top_k=-1 and monitor=None) and skip if not every n steps ((global_step + 1) % every_n_steps != 0)

At the same time, I also want to store the last checkpoint(save_last=True) so that we could resume training from crash.

However, this exception prevent me from doing so:

if self.save_last:
    raise MisconfigurationException(
        'ModelCheckpoint(save_last=True, monitor=None) is not a valid configuration.'
        ' You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None)'
    )

I'm wondering if we could relax that constraint by giving a warning instead?

@carmocca
Copy link
Contributor

carmocca commented Nov 6, 2020

I'm wondering if we could relax that constraint by giving a warning instead?

Yes! this was already discussed in slack. Feel free to open a PR.

Ananth S Oct 15th at 6:23 AM
Why does this ModelCheckpoint instantiation raise a misconfiguration error?
Isn't saving the last checkpoint independent of the monitor/topK?

Adrian  23 days ago
@carmocca do you remember?

carmocca  22 days ago
Because monitor=None already saves the last ckpt (but not named last.ckpt) so why would you want to save last again (but named last.ckpt) (edited) 

carmocca  22 days ago
Maybe we should have used the phrase "is a redundant configuration" instead of valid

carmocca  22 days ago
save_last only makes sense when something is being monitored

Ananth S  22 days ago
i see. we are using torchelastic for long training runs. using save_last gives us a checkpoint with name last.ckpt , which is a deterministic path name to find and resume training from

Ananth S  22 days ago
what we really want is "save the most recent checkpoint as last.ckpt" (edited) 

Adrian  22 days ago
That is the original meaning of save_last. To have just that name. It should be a copy of the same file epoch=...cktp

Adrian  22 days ago
I definitely agree this is useful to have

carmocca  21 days ago
Feel free to send a PR changing "raise MisconfigurationException" to a warning. Also changing "is not a valid" for "is a redundant"

@rohitgr7
Copy link
Contributor

rohitgr7 commented Nov 22, 2020

save_last only makes sense when something is being monitored

why is that??
IMO save_last should have nothing to do with the monitor or save_top_k. why do we have this warning??

with

ModelCheckpoint(save_last=True, monitor=None, save_top_k=None|0|-1)

it raises this warning

but I think, with

ModelCheckpoint(save_last=True, monitor=anything, save_top_k=anything)

it should not raise any warning or exceptions w.r.t to save_last

@carmocca
Copy link
Contributor

why is that??

From my last comment:

Because monitor=None already saves the last ckpt (but not named last.ckpt) so why would you want to save last again (but named last.ckpt)

@rohitgr7
Copy link
Contributor

Because monitor=None already saves the last ckpt (but not named last.ckpt)

oh ok. Then I suggest it should be reverted back to an exception, with an additional condition of self.save_last and self.save_top_k == -1 since saving the same checkpoint twice doesn't make sense. If one needs to access the last checkpoint, the path available in besk_k_models with epoch key. Also, save_last should be independent of the monitor, I think.

@awaelchli
Copy link
Contributor

the original feature of save_last was to just have a copy of epoch=x.ckpt saved as last.ckpt. This would happen regardless of other settings. See original PR: #1908 and original feature request: #1658.
The benefit of having this file is to be able to load/resume the checkpoint programmatically through last.ckpt without the need of knowing what the last epoch was. This is also what @ananthsub describes here in the original message.
I hope we can keep this feature as I find it is very useful.

@carmocca
Copy link
Contributor

I believe we all get confused because save_last can mean two different things.

  1. A copy of the last checkpoint saved (regardless of anything else). This would ideally be a symlink.
  2. A checkpoint of the model just before training is over.

where save_last corresponds to (2). Note that (1) and (2) are the same thing in some circumstances.

In #4335 (comment) I propose having (1) and (2) as symlink_to_last and save_on_end respectively for clarity.

Also, save_last should be independent of the monitor, I think.

It should if we are talking about (1). If we are talking about (2), it is redundant because if monitor is None we will save every validation run.

@Borda Borda modified the milestones: 1.1, 1.1.x Nov 30, 2020
@Borda Borda modified the milestones: 1.1.x, 1.2 Dec 30, 2020
@edenlightning edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021
@carmocca carmocca self-assigned this Feb 22, 2021
@edenlightning
Copy link
Contributor

@carmocca what is left TODO here?

@carmocca
Copy link
Contributor

carmocca commented Jul 1, 2021

To implement the split described in #4335 (comment)

Which would clarify the arguments, simplify the code, and facilitate custom ModelCheckpoint extensions

@jzazo
Copy link

jzazo commented Feb 7, 2023

Can we revive this discussion? I was also confused about this. To me, save_last = True should mean save the last checkpoint when training finishes. If we are using early stopping, or training finishes normally, save the last. This would also add the latest tag to loggers.

monitor, mode and every_n_epochs / every_n_steps should work together for checkpointing every specified frequency. This would add the best tag to loggers.

Have a symlink on latest checkpointed model for failure recovery purposes (which will rely on the every_n_epochs / every_n_steps setting). This would also add the latest tag to loggers, which would link to the same checkpoint as the one save_last = True would make if set.

Also, right now, filename setting only works if save_last = False. Otherwise, everything is called last.cpkt (unless rename attribute is overriden).

With the above, the symlink can be named last.cpkt, and the rest of checkpoints whatever filename specified, and all settings would work together.

Would this behavior work with everything that has been said in this discussion?

@donglihe-hub
Copy link

donglihe-hub commented Jan 7, 2024

ModelCheckpoint has become quite complex lately, so we should evaluate splitting it some time in the future.
Any further changes we do should line up with a thought out future API.

Hi @carmocca . I see your proposed argument "save_on_end" is really helpful to me since saving model to disk is much more time-consuming than saving to memory. To address the issue I have to write my own ModelCheckpoint. I'm wondering will Lightning enable such functionality provided by save_on_end in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests