-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save / load from checkpoint TP #269
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the specified issue. Wouldn't it be better to have a test verifying that the issue is solved ?
I have a few comments, although I am not able to grasp the rationale behind all the changes that come on top of code I am not familiar with.
optimum/neuron/distributed/base.py
Outdated
is_zero_1_optimizer = optimizer.__class__.__name__ == "NeuronAcceleratedOptimizer" and isinstance(optimizer.optimizer, NeuronZero1Optimizer) | ||
is_zero_1_optimizer = is_zero_1_optimizer or isinstance(optimizer, NeuronZero1Optimizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the check in the first line equivalent to isinstance(optimizer, NeuronAcceleratedOptimizer)
?
Anyway, since in the second line you also accept just NeuronZero1Optimizer
, maybe it could be simplified to:
is_zero_1_optimizer = isinstance(optimizer, NeuronZero1Optimizer
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because you can have:
- A
NeuronZero1Optimizer
, which we do not support yet - A
NeuronAcceleratedOptimizer
which is a wrapper around the optimizer that is always applied when usingaccelerate
(I think we will always fall in that case but I wanted to cover all the cases here). This wrapper contains anoptimizer
attribute, and we check if it's aNeuronZero1Optimizer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update and providing a test. However, I think this test is not picked up by te CI: is this intended ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is failing: FAILED tests/distributed/test_training.py::test_tp_save_and_resume_from_checkpoint - RuntimeError: You need to log in the Hugging Face Hub otherwise you will not be able to push anything.
Thank you for fixing this ! |
As per title.
Fixes: #249