-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cant reload from checkpoint when using SWA #11665
Comments
Hi! Can I take this issue? |
hey @BttMA can you update the reproducible colab link? currently it points to the one in the repo which doesn't have any of your update code. |
Hello :) @rohitgr7 |
hey @BttMA ! I tried updating your example with:
and it worked fine... so I guess I am unable to reproduce your issue. |
Sorry me neither I couldn't share the bug with you using the boring model. is any either way to share it ? |
share the notebook/script that is failing. You can attach it here too, in case someone else want to look at it. |
it has personal/confidential data :/ I cant share it with everybody :o |
maybe you can mimic the data. for starters, return random tensors of the same shape as your original data returns and use a small model. |
hi @rohitgr7 :) PS : maybe it has something to do with "UserWarning: SWA is currently only supported every epoch." or "Swapping scheduler |
yes! that might be the case.. We need to save and load the states for this callback to enable proper resuming. |
What callback? Sorry didn't get you. Maybe the SWA does not support many epochs 🤔 ?? even in this doc it is not very clear about the epoch and the SWA. We have to dig deep into this! 😜 |
I am talking about StochasticWeightAveraging the warning isn't reliable. Should be improved. I only got to know what it means by looking at the code.
ideally it means if you are configuring return dict(lr_scheduler=dict(scheduler=scheduler, interval='step'),
optimizer=optimizer)
check out the default parameters. by default it starts at when epoch=0.8*max_epochs. |
now I see! for exemple if I have 100 epochs then the SWA callback will be activated at the 40th epoch (since 40=0.8*100). In my case, the SWA callback is just skipped because of the |
80th epoch. In your example, during the first run, it switch to SWALR at 40th epoch and saved the checkpoint at 50th epoch with SWALR state_dict. But when you reloaded the checkpoint, the trainer loaded them with LAmbda LR configured. Something like LambdaLR is trying to load the state_dict of SWALR, which is causing this error. |
OH!! sure yes yes!!
Now it makes sense :) Thank a lot! Can you suggest any thing for me to fix this, please?
|
I'm not sure if this will work. I am not super familiar with every detail for SWA but I don't think that replacing the scheduler is all that's required to perform SWA. There's a lot more happening inside the callback. For the fix, I think we need to create states for this callback that can be stored and reloaded from the checkpoint while resuming the training. Need to investigate what all is required to make this work. |
Actually I was going to suggest that but I don't know what held me 😅 thanks a lot! |
This is correct. Saving and loading is not implemented.
This is done by the callback automatically. |
🐛 Bug
My model worked just fine until I tried some optimisation using SWA.
The problem is not even clear to understand :
To Reproduce
https://colab.research.google.com/github/PytorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report/bug_report_model.ipynb
Expected behavior
Run from checkpoint with SWA.
Environment
cc @tchaton @rohitgr7 @akihironitta @carmocca
The text was updated successfully, but these errors were encountered: