Add callback method for on_save_checkpoint #2401

jeremyjordan · 2020-06-28T15:53:50Z

🚀 Feature

We should allow Callback objects to optionally persist state that can be reloaded from checkpoints.

Motivation

We already manually save the state for early stopping and model checkpoint callbacks. This refactor would eliminate callback-specific code in the Trainer and extend the ability to user-written callbacks.

Pitch

This callback would just return a state_dict which the Trainer could store. The only thing that I am unclear how we should handle is for other callbacks how we want to reinitialize the state. If we can expect that the same exact callbacks will be passed to the Trainer then it should be trivial. Or we could expect that you only pass in a single instance of each callback class (eg. callbacks=[CustomerLogger(), EarlyStopping(), ModelCheckpoint()] and not callbacks=[CustomerLogger(params_a), CustomerLogger(params_b), EarlyStopping(), ModelCheckpoint()] and just keep a mapping of callback class to state dicts. However, if the user passed multiple callback instances of the same class I'm not sure how we would want to handle that.

I would recommend that we document the following constraints:

All objects in the dictionary must be pickle-able.
You cannot persist multiple instances of the same callback class.

awaelchli · 2020-06-28T17:05:55Z

I think it is reasonable to assume there is only one instance of these special callbacks (and should raise error otherwise, e.g. see progress bar callback). Note that currently the logger is not a callback.
Also, the documentation for callbacks should probably let the user know that the the order of the list input to the Trainer is not preserved (e.g. the Trainer should reorder it so that the earlystopping callback comes after checkpoint, right?)

jeremyjordan · 2020-06-28T19:22:45Z

Cool, I'll put together a draft PR once we merge #2391 :)

edenlightning · 2020-08-03T22:24:50Z

also #2631

jeremyjordan added feature Is an improvement or enhancement help wanted Open to be worked on labels Jun 28, 2020

jeremyjordan mentioned this issue Jun 28, 2020

fixes for early stopping and checkpoint callbacks #1504

Merged

10 tasks

awaelchli mentioned this issue Jun 29, 2020

Will load_from_checkpoint load Huggingface models as well? #2404

Closed

jeremyjordan mentioned this issue Jul 4, 2020

callback method for on_save_checkpoint #2501

Merged

10 tasks

awaelchli closed this as completed in #2501 Aug 28, 2020

ORippler mentioned this issue Dec 11, 2020

ModelCheckpoint fails at garbage collecting checkpoint passed to Trainer.resume_from_checkpoint #5090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add callback method for on_save_checkpoint #2401

Add callback method for on_save_checkpoint #2401

jeremyjordan commented Jun 28, 2020

awaelchli commented Jun 28, 2020 •

edited

Loading

jeremyjordan commented Jun 28, 2020

edenlightning commented Aug 3, 2020

Add callback method for on_save_checkpoint #2401

Add callback method for on_save_checkpoint #2401

Comments

jeremyjordan commented Jun 28, 2020

🚀 Feature

Motivation

Pitch

awaelchli commented Jun 28, 2020 • edited Loading

jeremyjordan commented Jun 28, 2020

edenlightning commented Aug 3, 2020

awaelchli commented Jun 28, 2020 •

edited

Loading