-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resuming should allow to differentiate what to resume (steps/opti/weights) #5339
Comments
one way this can be done is by updating the error to just a warning and make relevant changes below: cc: @PyTorchLightning/core-contributors thoughts? |
Hey @rohitgr7, I am not sure resuming and changing optimizer / scheduler is the best option. Possibly, we could extend LightningModule configure_optimizers to take current_epoch. def configure_optimizers(self, current_epoch: int = None):
return [optimizers], [lr_schedulers], should_update: bool Internally, we inspect the model function. If the user requested Example: def configure_optimizers(self, current_epoch: int = None):
milestones = [0, 100, ...]
optimizers = None
lr_schedulers = None
if current_epoch == milestones[0]:
# init optimizers / schedulers
elif current_epoch == milestones[-1]:
# init new optimizers / schedulers
...
return [optimizers], [lr_schedulers], current_epoch in current_epoch milestones What are your thoughts ? |
@tchaton yeah this is a good suggestion, but what if someone sets |
I brought this up because the current options (save_weights_only or not) are a little too restrictive. I have found a hack for now for me but there is I think a more widespread use-case to be able to change the optimizers, as SGD for example has certain properties that Adam has not etc., and one might want to exploit those at different stages of training. |
I think that's too complicated and its complexity will grow too large if we wanted to do it in different occasions other than epoch In general, I think we should let users save any combination of {training, optimizer, model} state and resume training with whichever combination is provided. |
Is there any progress how/when such a combinatorial choice for the users will be implemented? |
No current progress/plans that I know of. Implementing this will require attention |
@ananthsub what do you think about this? could be included in our plans for better fault-tolerant training |
@carmocca , related but not addressing the exact same matter, what about exposing the strict flag of the torch.nn.Module.load_state_dict? |
You can already set the strict flag in which is passed to: |
@carmocca, thanks for the comment. |
@carmocca , thumbs up means you agree exposing it will be a good move? |
I agree there should be a way to set The relevant piece of code is here: One idea would be to have a property in the Another option would be to save whether to load strict or not inside the checkpoint itself and do: model.load_state_dict(checkpoint['state_dict'], strict=checkpoint.get('strict', True)) Any thoughts? |
@carmocca , somehow it seems like my comment disappeared.. I'll write it again |
I personally don't like flags which only work if others are active, so I'd rather avoid this solution
Are you talking about a hook that encapsulates the |
@carmocca , |
@awaelchli I think this is can be viewed as an extention to #9405 One half-baked idea is to specify a dataclass around what parts of the checkpoint should be loaded
then trainer.fit/validate/test/predict accept this dataclass instead of |
I am having this problem, I set save_weights_only=True and now if I try to do resume_from_checkpoint= '/path/file.ckpt' it raise the following error: 'Trying to restore training state but checkpoint contains only the model. This is probably due to |
load_from_checkpoint can be set |
@carmocca I'm training models with PTL using |
One hacky way to do this currently would be to override the |
Here's a model doing linear warmup for 5 hours but then the cosine annealing base_lr is too high so it diverges. I wish I could have played with that base_lr rather than retraining from scratch: Here's a model that warmsdown (from LR 10 to 3.2) for 10 hours, and then does cosine annealing with base_lr 1.0 and some cosine cycle schedule. Here is a model that does warmup but then the cosine cycle appears too big: |
Any update on this? When using the OneCycleLR scheduler it is not possible to resume a training since the number of steps is exceeded. It would be great to be able to restart the scheduler but keep the epoch and step info. As a workaround I am just loading the weights following the approach in #16014 (comment) |
Any updates? I have the same problem, cannot change Optimizer/LR on a model when I resume training. |
Any updates on this? splitting up opt-state/lr-schedule/global-step/weights is necessary for serious training setups where e.g. schedule is tuned on the fly |
Currently it is possible to either resume only the full training state (epoch/global steps / optimizer / scheduler options / and weights), or only the weights.
I would like to be able to switch the optimizer at some point, i.e. skip restoring optimizer/scheduler, but still load the epoch/global steps. I only see a way to do this with hacks at the moment. Any other way? Could this be a feature, to specify in the trainer.init function specifically what to restore/what not to restore?
The text was updated successfully, but these errors were encountered: