-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to save checkpoints per epoch on a single GPU #646
Comments
Hi @aenaliph
I'll create patches for both issue asap but to get you unblocked you can already add |
Thanks for the pointers @mreso. I am indeed training with peft. Following the quickstart notebook the peft config is passed to the model in Step 4: Prepare model for PEFT I did try earlier with
|
@aenaliph I was able to reproduce the error in the notebook and prepared a fix for that as well. To get you unblocked you can set train_config.run_validation back to False and then the saving of the checkpoint should be skipped but will be successfully done in step 6. |
@mreso Yes, I am able to save the model as in step 6, just not with validation on. I will await the fix. Thanks for looking into it. |
Closing this as #650 got merged. Feel free to reopen if the issues persist with the fixes. |
I am able to finetune the llama3.1 base model on a custom dataset with the alpaca format by following the quickstart peft finetuning notebook. However, I am unable to save the model after each epoch. I have installed llama-recipes from source.
Below is my train config and the error trace. Any pointers? Could it be related to this fix?
#629
I am on torch 2.4 and using an NVIDIA GeForce RTX 4090
At the end of each epoch I would like the validation dataset to be run and the checkpoint to be saved.
The text was updated successfully, but these errors were encountered: