-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix FP8 checkpoint resumption with onnx export flag #2907
Conversation
901e5cb
to
1a0685f
Compare
@j316chuck : Thinking a bit more about it, I think we should apply it when we are creating a checkpoint instead of the time when we are requesting precision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dskhudia I think it's a bit more clean to cast when requesting precision since the checkpoint code save/load
is a bit scattered throughout checkpoint.py
and I have to add this context manager in multiple places (10+ lines vs 1 line). Wdyt?
@j316chuck : sounds good. I looked at the defintion of this context manager and it doesn't do much besides setting a global variable. I was worried about this slowing down training but it probably doesn't. Feel free to merge. |
What does this PR do?
Before this PR, checkpoint resumption on FP8 did not work since checkpoints serialized extra FP8 buffers in bytes which is incompatible with torch's
load_sharded_optimizers
After this PR checkpoint resumption from fp8 serializes out extra fp8 buffers in tensor onnx mode.
Tests
Torch 2.1
mpt-125m-sharded-regression-fp8-ignore-td0frt
🔴mpt-125m-sharded-regression-fp8-ignore-6DD62n
✅Torch 2.3
mpt-125m-sharded-regression-fp8-ignore-tAGCla
✅mpt-125m-sharded-regression-fp8-ignore-8bZirC
✅