-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add version suffix to last.ckpt #5030
Comments
@awaelchli answering #5000 (comment) here How do you propose we fix this then? |
@carmocca Before I make a suggestion could you help me understand the context a bit more. Is this issue a) about damage control for users that accidentally write to the same directory, or b) is this about a particular use case where a user is doing this on purpose? I am asking this because of two reasons. 1) I notice in your description you didn't append epoch numbers to the filenames and the primary reason why the save_last option is exists is for sequential checkpoints numbered by some metric like epoch or step. 2) last.ckpt is always a copy of the last epoch, i.e. epoch_N.ckpt is identical to last.ckpt. Therefore it would be save to overwrite last.ckpt in a subsequent run. Note that situation a) cannot be solved by appending version affixes to the files. Consider the case where user changes epoch size between runs so they would get a mix of files and you wouldn't be able to tell if epoch=x_step=y.ckpt or epoch=x_step=z.ckpt belonged to the first or second run. One would only be able to tell by inspecting the individual checkpoints for hyperparameters. |
One may consider this option:
|
It is about a)
Sorry, I didn't for brevity
That is not the case if you are monitoring something. For example, if you run an experiment for 10 epochs and your best performance is
This is impossible already in master, version suffixes do not imply that all |
if However, this would be inconsistent with what was discussed in #5000. So should we just start versions at 1 and forget about renaming files?
|
This is correct behaviour. last.ckpt as the name suggests should point to the last saved checkpoint in a run, in this case it would be epoch 10 (epoch=9.ckpt). Again, last.ckpt is there for convenience if the user expects their run to be interrupted (for example manually) they can easily restore and continue training with last.ckpt without having to look up the exact filename. Having a file best.ckpt is a separate feature one could look into :) |
Exactly, but you might not have saved on your tenth epoch because your model wasnt in the top-k
So |
These edge cases are terrible. Looks like the versioning is necessary then. |
No strong opinion here. I am fine with both #5030 (comment) renaming or not, 0 or 1 as the first prefix version. |
@carmocca is this done? |
No |
🚀 Feature
If you use
ModelCheckpoint(save_last=True)
and you run an experiment twice in the same directory, then this set of checkpoints is generated:the idea is to add a version also to
last.ckpt
if it would get overwritten:Motivation
Avoid overwriting existing checkpoints
Alternatives
Modifying the default
CHECKPOINT_NAME_LAST
to avoid the conflictAdditional context
Discussed in #5000 (comment)
To be done after #5008 is merged.
cc @Borda @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7
The text was updated successfully, but these errors were encountered: