-
-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to save last checkpoint #1613
Comments
hi @Nero10578 , can you verify for me what checkpoint step numbers it did save on as well as the total number of steps in the training? thanks! |
It saved on these checkpoints: There should be 1578 number of steps as can be seen here:
I've tried running it again to see if its a fluke and no its still failing to save at the end. I've tried with a test super short dataset and it saves fine otherwise. Is there something wrong with my set number of steps? It say this when resuming:
|
it looks like the reason it didn't save the last step is that it is saving every 395 steps, so that means the next step it would save at is 1580, but your last step is 1548. Let me see if there is a good way to workaround that. |
@Nero10578 Fixed in #1615 |
Awesome fix! Thank you for all your work on this! So essentially this was just a problem with odd saving steps? Explains why it only happens sometimes. |
Please check that this issue hasn't been reported before.
Expected Behavior
Expected behavior is to save the last checkpoint like the previous intermediate checkpoints. It has no failed to save the final checkpoint multiple times. I am running this on Ubuntu WSL2 in Windows 11.
Current behaviour
At the end of a training run, it will not save the last checkpoint.
Nothing wrong seems to happen as shown.
Steps to reproduce
Just run any training run both SFT or DPO both I've tried failed to save the last checkpoint. Not sure if there is something wrong in my config yaml for the train or a bug on Axolotl.
I've tried enabling and also disabling wandb since that caused this issue sometime a few months ago as well. This time it made no difference.
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11.9
axolotl branch-commit
2147cf6
Acknowledgements
The text was updated successfully, but these errors were encountered: