-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix training artifacts for 2GB+ models and MSELoss
#22414
Fix training artifacts for 2GB+ models and MSELoss
#22414
Conversation
The use of a global base model when creating new training `Blocks` and `onnx.save` destroying any external data meant any loss block (e.g. `MSELoss`) that builds more than one sub-`Block` will fail validation due to missing external data. Saving using a deep copy of the global model circumvents this. Fixes microsoft#22411
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocks kinda (@microsoft-github-policy-service agree company="Microsoft")
@microsoft-github-policy-service agree company="RWS" |
/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline |
/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline, |
/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline |
Azure Pipelines successfully started running 6 pipeline(s). |
Azure Pipelines successfully started running 5 pipeline(s). |
Azure Pipelines successfully started running 9 pipeline(s). |
+1 |
Tks @snnn and @baijumeswani |
I think this will be included in the upcoming 1.20 release. |
tks @baijumeswani |
Description
generate_artifacts
fails when creating training artifacts for a model using external data andMSELoss
.The use of a global base model when creating new training
Blocks
andonnx.save
destroying any external data means any loss block (e.g.MSELoss
) that builds more than one sub-Block
will fail validation due to missing external data and raise an exception.Fix
Saving using a deep copy of the global model circumvents this at the cost of holding 2x the model size in memory.
Other Implementations
An alternative approach using less memory would load the on-disk external data before it is deleted in
Block::__del__
and insert the appropriate fields into the globalModelProto
.This seems a bit brittle due to the coupling to the specific way external data is destructively accessed in
onnx.save
. If there exists a non-modifying save in theonnx
repo it would be ideal to use that inBlock::__call__
instead.Motivation and Context
Fixes
generate_artifacts
bug reported in #22411