-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Create a ModelCheckpointBase callback #6504
Comments
Related to this issue, I want to implement a logic: saving the best checkpoint's valid predict result during saving best checkpoint, and I want to save it in another file rather than in checkpoint's binary file. Right now, I can't know who calls |
Previous discussion: #4335 (comment) Should we close that one in favor of this one? @ananthsub |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
I have a different take on this now: which is extending the framework is relying on the class hierarchy to determine the intent of the callback. This is used by the Trainer especially to determine what callbacks are enabled by default. This is used for:
I think for these, we ought to create an empty base class. This way, users don't have to worry about changes to the concrete implementations also offered by the framework. As an example: class BaseModelCheckpoint(Callback):
pass
class ModelCheckpoint(BaseModelCheckpoint):
# existing code today
class MyCustomModelCheckpoint(BaseModelCheckpoint) Essentially, someone should be able to extend BaseModelCheckpoint, not ModelCheckpoint, which is completely empty, and customize this however they see fit without worrying about keeping their code in sync with the framework's changes to ModelCheckpoint. |
What would be the advantage internally by offering this base empty class? |
custom checkpoint callbacks ( |
I am in favor of this. I think it's becoming increasingly important that
We couldn't do this in the past because we didn't support multiple instances of the same callback. Related: #4335 where I propose a split. We would need to revisit it because that discussion is old |
🚀 Feature
Create a ModelCheckpointBase callback, and have the existing checkpoint callback extend it
Motivation
The model checkpoint callback is growing in complexity. Features that have been recently added or will soon be proposed:
https://github.com/PyTorchLightning/pytorch-lightning/blob/680e83adab38c2d680b138bdc39d48fc35c0cb58/pytorch_lightning/trainer/training_loop.py#L152-L163
The decision was made in #6146 to keep these triggers mutually exclusive, at least based on the phase they run in. Why? It's very hard to get the state management right. For instance, the
monitor
might be added for something that's available only during validation, but the checkpoint callback is configured to run during training too, and crashes when it tries to look up themonitor
key in the available metrics for tracking. Tracking top-K models and scores is another huge pain. Supporting multiple monitor metrics on top of this is another beast.cc @Borda @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7
Pitch
Move the existing logic for the following into a base class:
And have thin wrappers on top which extend this class and implement callback hook(s) for when to save the checkpoint.
Alternatives
The checkpoint callback gets bigger and bigger as we add more features to it.
Additional context
The text was updated successfully, but these errors were encountered: