-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix issues to be compatible with latest peft #359
fix issues to be compatible with latest peft #359
Conversation
Your pull request isn't working. Crashed when it tries to save a checkpoint, was training on 8x RTX 3090. |
@lksysML if you're training in 8bit, there's a separate bug with |
Yea, I ran into that issue yesterday and fixed it already. I think that today's run has something to do with peft, I rolled back to an old version and it didn't crash. Also, debug_mode doesn't work. I set it to True and it continued with the whole training set instead of the 1024 examples as it is supposed to. |
Hello, this PR works fine for me, for debug mode you have to specify |
I did test and it does work Continuing from the ckpt of the above run via:
results in it being correctly loaded: In the above run, the eval loss at the end was 1.70. In the run here using above command: https://wandb.ai/smangrul/huggingface/runs/18ux1bhz?workspace=user-smangrul, the eval loss starts at 1.65 and ends at 1.48 |
Er... the logic of the code there is:
the For reference,
The loading of |
I don't think this is true. There was a very clear marked difference in quality with the old, working version of resume_from_checkpoint, and the current version of resuming from just the adapter. It wasn't just a coding mistake.
I know it works. My point was that this is only half the required solution. I was able to get this solution working myself without the PR. The results are just bad. #154 (comment) |
this gist https://gist.github.com/pacman100/8e7a6eedabf34e1a88dd74a96c3b619f should exhibit the behaviour that you are looking for. But, it doesn't make much sense to me, could you provide a concrete example of how the existing code with previous peft was being used and how this PR is failing to do that with some concrete metrics?
Both the above points are a weird way of continuing training. |
I appreciate that. I'll play around with it and see how everything works out. The idea was basically to introduce the new data as a new "epoch", which helped include the relevant data without needing to fully retrain the adapter every night (which takes about 7 hours for me). Then I can delay needing to spin the whole adapter retraining up for a longer period of time, while still showing noticeable improvements day-to-day. It also fixed a major problem where sometimes training would fail in the middle of the night, and I'd need to use the old adapter until I could fix the problem. I would increase the epoch from (let's say) 5 to 10, so the learning rate decay would still exist, but not be 0. |
I dont know if this is the RIGHT way, but this simple modification at L275 produces a - model.save_pretrained(output_dir)
+ model.save_pretrained(output_dir, state_dict=old_state_dict()) |
What does this PR do?
debug_mode
arg to quickly test out the fine-tune script on tiny subset of dataset