[checkpoint] Resolve 2 different checkpoint loading paths across `fit` vs `validate`/`test`/`predict` #9405

ananthsub · 2021-09-09T15:05:47Z

Proposed refactoring or deprecation

Consolidate checkpoint loading code across fit and validate/test/predict

Motivation

We have 2 different code paths for checkpoint loading

Trainer.fit and the constructor argument resume_from_checkpoint
ckpt_path passed to Trainer.validate/test/predict

Offering multiple code paths here risks divergence. Lightning must ensure a consistent experience for checkpoint loading across these different entry points.

Background

These are the paths today.

Trainer.fit:

Trainer constructor is initialized with resume_from_checkpoint: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L143
Which initializes the checkpoint connector with resume_from_checkpoint: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L383
Which sets this field in the connector: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L38
In _run we call Trainer._restore_modules_and_callbacks: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L960-L962
Which explicitly checks if we're fitting, and if so, first loads the checkpoint, before restoring the model

Trainer.validate/test/predict:

ckpt_path is an argument to the function
There's a common helper to validate the checkpoint path: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L1200
A property is set on the trainer for the corresponding path: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L671-L673
We use an umbrella property to work across validate/test/predict: https://github.com/PyTorchLightning/pytorch-lightning/blob/6e124e7207f6459cb43f540cfb5a1c6cc9b00f7a/pytorch_lightning/trainer/properties.py#L626-L633
There are checks for loading based on if this property is set inside Trainer._run
-- https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L950-L951
-- https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L1006-L1007

Notably:

trainer.fit calls restore_model whereas trainer.validate/test/predict calls restore_model_weights in the checkpoint connector. https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L116-L150
Load checkpoint weights is called before the accelerator setup while restore model is called after: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L950-L951r: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L960-L962

Pitch

Unify in CheckpointConnector
CheckpointConnector.restore_model is a superset of CheckpointConnector.restore_model_weights which suggests we don't need both.
Unify in trainer.py

The overall sequence in _run can be shared:

Load & validate the checkpoint dict through the training type plugin
Load the datamodule state
Load the LightningModule state through the training type plugin
Load the callback states
Load the loop states
If fitting, load the precision, optimizer, and LR scheduler states

We have 5 different properties exposed for the checkpoint path to resume from (excluding HPC stuff):

trainer.resume_from_checkpoint:
trainer.validated_ckpt_path: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L671-L674
trainer.tested_ckpt_path: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L760-L762
trainer.predicted_ckpt_path : https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/trainer.py#L843-L845
An umbrella property across these 3: trainer._ckpt_path: https://github.com/PyTorchLightning/pytorch-lightning/blob/6e124e7207f6459cb43f540cfb5a1c6cc9b00f7a/pytorch_lightning/trainer/properties.py#L626-L633
-- This is also inconsistent: the attributes initialized in validate/test/predict are public, while _ckpt_path is private. Why?

It's unclear what the lifecycle of these properties should be. Do successive calls to validate/test/predict end up relying on this?

Proposal:

Pass the checkpoint path to resume from as an argument to Trainer._run to avoid our dependency on these properties
Deprecate these properties

Unify in the Trainer API
Proposal: deprecate resume_from_checkpoint from the Trainer constructor, and add a new argument ckpt_path to Trainer.fit . This provides API consistency with validate/test/predict

Additional context

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

ananthsub · 2021-09-09T17:33:25Z

paraphrasing @awaelchli

Remove the trainer argument will greatly impact usability and configurability through CLI. To that you will counter argue that Trainer should not be a configuration system. Still, sad to see that we are just taking it away.

but how does the CLI work today for passing ckpt_path to validate, test, or predict ? why is this any different from that?

Related to that I was hoping we could take this one up some day: #5339

I commented there with what could be viewed as an extension to the trainer argument discussion here

carmocca · 2021-09-14T12:33:38Z

but how does the CLI work today for passing ckpt_path to validate, test, or predict ? why is this any different from that?

In the CLI you specify the fit checkpoint with

python script.py fit --trainer.resume_from_checkpoint=...

and the test checkpoint with:

python script.py test ckpt_path=...

This PR would make it consistent.

I personally think there is a lot of value in doing this - if this wasn't pursued then I'd suggest renaming resume_from_checkpoint to fit_ckpt_path to address the confusion

ananthsub added feature Is an improvement or enhancement refactor checkpointing Related to checkpointing labels Sep 9, 2021

ananthsub self-assigned this Sep 9, 2021

This was referenced Sep 9, 2021

Feature incompatibilities with HPC/Slurm saving & loading #9407

Closed

Resuming should allow to differentiate what to resume (steps/opti/weights) #5339

Open

tchaton added the let's do it! approved to implement label Sep 10, 2021

daniellepintz mentioned this issue Sep 14, 2021

Deprecate resume_from_checkpoint from Trainer constructor in favor of adding ckpt_path to fit() #9501

Closed

jjenniferdai self-assigned this Sep 21, 2021

jjenniferdai mentioned this issue Sep 24, 2021

[see #10061 instead] Unify checkpoint load paths #9693

Closed

12 tasks

awaelchli mentioned this issue Oct 3, 2021

Expose strict argument to model loading in trainer.tune #9798

Closed

jjenniferdai mentioned this issue Oct 21, 2021

Unify checkpoint load paths [redo #9693] #10061

Merged

12 tasks

tchaton closed this as completed in #10061 Oct 25, 2021

ananthsub mentioned this issue Nov 10, 2021

[RFC] Move Trainer's loop-affecting arguments to fit, validate, test, and predict #10444

Closed

blisc mentioned this issue Nov 16, 2021

Provide a Default Parameter for Fit's Checkpoint Restore Path #10573

Closed

ananthsub mentioned this issue Dec 22, 2021

Trainer.test() in combination with resume_from_checkpoint is broken #5091

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[checkpoint] Resolve 2 different checkpoint loading paths across `fit` vs `validate`/`test`/`predict` #9405

[checkpoint] Resolve 2 different checkpoint loading paths across `fit` vs `validate`/`test`/`predict` #9405

ananthsub commented Sep 9, 2021

ananthsub commented Sep 9, 2021

carmocca commented Sep 14, 2021

[checkpoint] Resolve 2 different checkpoint loading paths across fit vs validate/test/predict #9405

[checkpoint] Resolve 2 different checkpoint loading paths across fit vs validate/test/predict #9405

Comments

ananthsub commented Sep 9, 2021

Proposed refactoring or deprecation

Motivation

Background

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

ananthsub commented Sep 9, 2021

carmocca commented Sep 14, 2021

[checkpoint] Resolve 2 different checkpoint loading paths across `fit` vs `validate`/`test`/`predict` #9405

[checkpoint] Resolve 2 different checkpoint loading paths across `fit` vs `validate`/`test`/`predict` #9405