[Training] >2GB Model Offline Artifacts fail with MSE Loss #22411

jkbeavers · 2024-10-11T19:15:35Z

Describe the issue

Error Description

Onnx runtime fails to create training artifacts when using a model with external data and any loss (custom, mse, bce with logits, l1) utilizing more than one Block.
This fails with a confusing message about the first tensor stored in the proto not being found in temp.onnx.data. i.e.
Data of TensorProto ( tensor name: ...) should be stored in temp.onnx.data, but it doesn't exist or is not accessible.

Bug Description

The recent patch to support >2Gb models missed testing the use of any of the other loss block besides crossentropy, which happens to not run into this issue.

The bug stems from three parts:

onnxruntimes's base training Block uses a global copy of the original ModelProto when creating new nodes for the training graph. This is shared between subsequently created Blocks
Block saves the model proto after building the new node and adding to the graph in __call__. This is used to then check the model's validity
onnx.save destructively modifies external data information in a ModelProto when calling set_external_data

When a second Block is created as part of the loss block, the tensors in the global ModelProto no longer hold external data and no external data file at temp.onnx.data is created; when checking the model validity, no tensor data can be found. For MSE this happens during the Sub block's validity check (if the "target" InputLike block doesn't already exist).

To reproduce

From a local build directory:

Edit test_generate_artifacts_external_data_separate_files in orttraining_test_ort_apis_onnxblock.py by changing CrossEntropyLoss to MSELoss
Run pytest orttraining_test_ort_apis_onnxblock.py -k test_generate_artifacts_external_data_separate_files

See error:
onnx.onnx_cpp2py_export.checker.ValidationError: Data of TensorProto ( tensor name: fc1.weight) should be stored in onnxruntime/build/Linux/RelWithDebInfo/temp.onnx.data, but it doesn't exist or is not accessible.

Urgency

This blocks the generation of ort training artifacts for models >2GB with many custom loss functions as well as all of the built-in ones besides CrossEntropyLoss.

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.2

PyTorch Version

2.4.1

Execution Provider

Default CPU

Execution Provider Library Version

No response

The text was updated successfully, but these errors were encountered:

The use of a global base model when creating new training `Blocks` and `onnx.save` destroying any external data meant any loss block (e.g. `MSELoss`) that builds more than one sub-`Block` will fail validation due to missing external data. Saving using a deep copy of the global model circumvents this. Fixes microsoft#22411

jkbeavers · 2024-10-16T00:21:08Z

Fixed in a5e85a

jkbeavers added the training issues related to ONNX Runtime training; typically submitted using template label Oct 11, 2024

jkbeavers mentioned this issue Oct 11, 2024

Fix training artifacts for 2GB+ models and MSELoss #22414

Merged

jkbeavers closed this as completed Oct 16, 2024

jkbeavers mentioned this issue Oct 23, 2024

[Training] Error building gradient graph for bert models for on-device training #22465

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] >2GB Model Offline Artifacts fail with MSE Loss #22411

[Training] >2GB Model Offline Artifacts fail with MSE Loss #22411

jkbeavers commented Oct 11, 2024 •

edited

Loading

jkbeavers commented Oct 16, 2024

[Training] >2GB Model Offline Artifacts fail with MSE Loss #22411

[Training] >2GB Model Offline Artifacts fail with MSE Loss #22411

Comments

jkbeavers commented Oct 11, 2024 • edited Loading

Describe the issue

Error Description

Bug Description

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

jkbeavers commented Oct 16, 2024

jkbeavers commented Oct 11, 2024 •

edited

Loading