-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify the model checkpoint arguments #4335
Comments
save_last: I agree, I believe this was the original intention when this feature was added. Definitely useful to have. |
Also if we allow tracking epochs or steps, |
I didn't see symlink supported from the fsspec API, so this could be challenging without that. but yeah that's the general idea: maybe we call it
Users might want to do both: e.g. save a checkpoint every 10,000 steps and at each epoch |
Yes, but I would support that by allowing having multiple ModelCheckpoint callbacks.
class ModelCheckpoint:
verbose: bool = False
save_weights_only: bool = False
period: int = 1
dirpath: Optional[Union[str, Path]] = None
filename: Optional[str] = None
symlink_last: bool = False # what you propose to be save_last
# This is missing a mechanism to track either epochs or steps
class TopKModelCheckpoint(Checkpoint):
# Notice these are not optional anymore
monitor: str
save_top_k: int = 1
mode: str = "auto"
save_on_end: bool = False # what we currently call save_last Just an idea. Open to modifications of course 😄 Edits:
|
I also agree the current I don't think symlink should be used as you never really know on which fs you are saving the checkpoint. I would let the user deal with the storage usage things. Ideally, a single |
I have a custom checkpoint callback that inherits from the At the same time, I also want to store the last checkpoint( However, this exception prevent me from doing so:
I'm wondering if we could relax that constraint by giving a warning instead? |
Yes! this was already discussed in slack. Feel free to open a PR.
|
why is that?? with ModelCheckpoint(save_last=True, monitor=None, save_top_k=None|0|-1) it raises this warning but I think, with ModelCheckpoint(save_last=True, monitor=anything, save_top_k=anything) it should not raise any warning or exceptions w.r.t to |
From my last comment:
|
oh ok. Then I suggest it should be reverted back to an exception, with an additional condition of |
the original feature of save_last was to just have a copy of epoch=x.ckpt saved as last.ckpt. This would happen regardless of other settings. See original PR: #1908 and original feature request: #1658. |
I believe we all get confused because
where In #4335 (comment) I propose having (1) and (2) as
It should if we are talking about (1). If we are talking about (2), it is redundant because if monitor is |
@carmocca what is left TODO here? |
To implement the split described in #4335 (comment) Which would clarify the arguments, simplify the code, and facilitate custom |
Can we revive this discussion? I was also confused about this. To me,
Have a symlink on latest checkpointed model for failure recovery purposes (which will rely on the Also, right now, With the above, the symlink can be named Would this behavior work with everything that has been said in this discussion? |
Hi @carmocca . I see your proposed argument "save_on_end" is really helpful to me since saving model to disk is much more time-consuming than saving to memory. To address the issue I have to write my own ModelCheckpoint. I'm wondering will Lightning enable such functionality provided by save_on_end in the future? |
🐛 Proposals
This is not so much a bug report as an RFC to clarify the
ModelCheckpoint
callback arguments:save_last
: to me, this means that whenever we save a checkpoint, we save a checkpoint with filename"last.ckpt"
. This provides a pre-determined checkpoint name, which is very helpful for resuming from failures. Importantly, it should not determine when checkpoints are saved. Currently it's easy to confuse this parameter to mean "save the checkpoint after the last epoch," which I think should be split out as a separate argument. This distinction would also clarify the typing and validation: there's no need for it to be anOptional[bool]
: either we save a checkpoint as"last.ckpt"
or not. So it could be a regularbool
.There's an inefficiency right now where we generate the checkpoint dict twice if
save_last=True
. For techniques like ZeRO that deal with sharded optimizer states, each checkpoint dict creation triggers communications across all ranks. Instead, we should gather the checkpoint dict once, and then save to different file paths accordingly (cc @justusschock @awaelchli @akihironitta @rohitgr7 @carmocca @ninginthecloud @jjenniferdai @SeanNaren, @blefaudeux)save_top_k
: sincemonitor
isNone
by default, this should forcesave_top_k
to be -1. The counterargument is that this can cause storage concerns. But I think this is easily correctable on the user-side: configuresave_top_k
+monitor
period
: we should rename this asevery_n_epochs
. this opens up extensions for checkpointing afterevery_n_steps
during training and checkpointing after a specified time interval. With those extensions in mind,period
is ambiguous. Another request here is to change the default filename from"{epoch}"
to"{epoch}-{global_step}"
to better support mid-epoch checkpointingcc @awaelchli @carmocca
The text was updated successfully, but these errors were encountered: