Fix FP8 checkpoint resumption with onnx export flag #2907

j316chuck · 2024-01-26T01:00:16Z

What does this PR do?

Before this PR, checkpoint resumption on FP8 did not work since checkpoints serialized extra FP8 buffers in bytes which is incompatible with torch's load_sharded_optimizers

After this PR checkpoint resumption from fp8 serializes out extra fp8 buffers in tensor onnx mode.

Tests

Torch 2.1

Failed run w/out this PR: mpt-125m-sharded-regression-fp8-ignore-td0frt 🔴
Working run w/this PR: mpt-125m-sharded-regression-fp8-ignore-6DD62n ✅

Torch 2.3

Run w/out this PR: mpt-125m-sharded-regression-fp8-ignore-tAGCla ✅
Working run w/this PR: mpt-125m-sharded-regression-fp8-ignore-8bZirC ✅

composer/core/precision.py

dskhudia · 2024-01-26T01:38:02Z

@j316chuck : Thinking a bit more about it, I think we should apply it when we are creating a checkpoint instead of the time when we are requesting precision.

j316chuck

@dskhudia I think it's a bit more clean to cast when requesting precision since the checkpoint code save/load is a bit scattered throughout checkpoint.py and I have to add this context manager in multiple places (10+ lines vs 1 line). Wdyt?

composer/core/precision.py

dskhudia · 2024-01-26T21:49:00Z

@j316chuck : sounds good. I looked at the defintion of this context manager and it doesn't do much besides setting a global variable. I was worried about this slowing down training but it probably doesn't. Feel free to merge.

j316chuck requested review from eracah, dakinggg and a team as code owners January 26, 2024 01:00

j316chuck requested a review from mvpatel2000 January 26, 2024 01:00

commit change

1a0685f

j316chuck force-pushed the chuck/add_onnx_eport branch from 901e5cb to 1a0685f Compare January 26, 2024 01:08

j316chuck changed the title ~~Add fp8 onnx export to fix resumption/checkpoint errors.~~ Fix FP8 resumption/checkpoint errors with onnx export Jan 26, 2024

j316chuck changed the title ~~Fix FP8 resumption/checkpoint errors with onnx export~~ Fix FP8 checkpoint resumption with onnx export flag Jan 26, 2024

j316chuck requested review from dskhudia and removed request for eracah January 26, 2024 01:14

dskhudia reviewed Jan 26, 2024

View reviewed changes

composer/core/precision.py Show resolved Hide resolved

dskhudia approved these changes Jan 26, 2024

View reviewed changes

j316chuck commented Jan 26, 2024

View reviewed changes

composer/core/precision.py Show resolved Hide resolved

j316chuck and others added 2 commits January 25, 2024 18:25

commit change

68438c0

Merge branch 'dev' into chuck/add_onnx_eport

0792b51

j316chuck merged commit 2c6a390 into dev Jan 26, 2024
16 checks passed

j316chuck deleted the chuck/add_onnx_eport branch January 26, 2024 21:49

ShashankMosaicML pushed a commit to ShashankMosaicML/composer that referenced this pull request Feb 3, 2024

Fix FP8 checkpoint resumption with onnx export flag (mosaicml#2907)

b27d67a

ShashankMosaicML pushed a commit to ShashankMosaicML/composer that referenced this pull request Feb 3, 2024

Fix FP8 checkpoint resumption with onnx export flag (mosaicml#2907)

525b992

hanlint mentioned this pull request Feb 27, 2024

Composer crashes when attempting to load sharded checkpoint mosaicml/llm-foundry#998

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FP8 checkpoint resumption with onnx export flag #2907

Fix FP8 checkpoint resumption with onnx export flag #2907

j316chuck commented Jan 26, 2024 •

edited

Loading

dskhudia commented Jan 26, 2024

j316chuck left a comment •

edited

Loading

dskhudia commented Jan 26, 2024

Fix FP8 checkpoint resumption with onnx export flag #2907

Fix FP8 checkpoint resumption with onnx export flag #2907

Conversation

j316chuck commented Jan 26, 2024 • edited Loading

What does this PR do?

Tests

dskhudia commented Jan 26, 2024

j316chuck left a comment • edited Loading

Choose a reason for hiding this comment

dskhudia commented Jan 26, 2024

j316chuck commented Jan 26, 2024 •

edited

Loading

j316chuck left a comment •

edited

Loading