-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restoring optimizer states (with DeepSpeed plugin used) #242
Comments
Hi there! We'll be working on adding a utility to help save/restore checkpoints in the coming month, so it should hopefully be easier to do this when it's there :-) |
Closed with #255 🎉 |
I noticed the current save_state function won't save epoch/steps count, is there any workaround to save it?, @muellerzr |
@seanbenhur once #262 is solved, this will be saved indirectly through the scheduler. Otherwise it's then up to the user to remember what epoch it's on, make note of it, etc. |
Got it, Thanks |
Accelerate is a great library! Thanks for the amazing work!
I was able to save the optimizer/scheduler states using the Accelerator library, but when restoring them back, I got CUDA out of memory error, I guess the optimizer states are not saved properly. I can restore the states without error by setting
ckpt_states = torch.load(state_path, map_location='cpu')
but not sure if it's correct.Could you provide some tips or suggestions? (I'm implementing a feature that can fully restore the training, but got into this problem) Thanks.
I guess that saving optimizer states for DeepSpeed is different, I saw the HF Trainer does this, this, and this, but not sure how to borrow that code into mine.
my checkpoint saving function is below:
my restore function is like:
The text was updated successfully, but these errors were encountered: