Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] >2GB Model Offline Artifacts fail with MSE Loss #22411

Closed
jkbeavers opened this issue Oct 11, 2024 · 1 comment
Closed

[Training] >2GB Model Offline Artifacts fail with MSE Loss #22411

jkbeavers opened this issue Oct 11, 2024 · 1 comment
Labels
training issues related to ONNX Runtime training; typically submitted using template

Comments

@jkbeavers
Copy link
Contributor

jkbeavers commented Oct 11, 2024

Describe the issue

Error Description

Onnx runtime fails to create training artifacts when using a model with external data and any loss (custom, mse, bce with logits, l1) utilizing more than one Block.
This fails with a confusing message about the first tensor stored in the proto not being found in temp.onnx.data. i.e.
Data of TensorProto ( tensor name: ...) should be stored in temp.onnx.data, but it doesn't exist or is not accessible.

Bug Description

The recent patch to support >2Gb models missed testing the use of any of the other loss block besides crossentropy, which happens to not run into this issue.

The bug stems from three parts:

  • onnxruntimes's base training Block uses a global copy of the original ModelProto when creating new nodes for the training graph. This is shared between subsequently created Blocks
  • Block saves the model proto after building the new node and adding to the graph in __call__. This is used to then check the model's validity
  • onnx.save destructively modifies external data information in a ModelProto when calling set_external_data

When a second Block is created as part of the loss block, the tensors in the global ModelProto no longer hold external data and no external data file at temp.onnx.data is created; when checking the model validity, no tensor data can be found. For MSE this happens during the Sub block's validity check (if the "target" InputLike block doesn't already exist).

To reproduce

From a local build directory:

  1. Edit test_generate_artifacts_external_data_separate_files in orttraining_test_ort_apis_onnxblock.py by changing CrossEntropyLoss to MSELoss
  2. Run pytest orttraining_test_ort_apis_onnxblock.py -k test_generate_artifacts_external_data_separate_files

See error:
onnx.onnx_cpp2py_export.checker.ValidationError: Data of TensorProto ( tensor name: fc1.weight) should be stored in onnxruntime/build/Linux/RelWithDebInfo/temp.onnx.data, but it doesn't exist or is not accessible.

Urgency

This blocks the generation of ort training artifacts for models >2GB with many custom loss functions as well as all of the built-in ones besides CrossEntropyLoss.

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.2

PyTorch Version

2.4.1

Execution Provider

Default CPU

Execution Provider Library Version

No response

@jkbeavers jkbeavers added the training issues related to ONNX Runtime training; typically submitted using template label Oct 11, 2024
jkbeavers pushed a commit to jkbeavers/onnxruntime that referenced this issue Oct 11, 2024
The use of a global base model when creating new training `Blocks`
and `onnx.save` destroying any external data meant any loss block
(e.g. `MSELoss`) that builds more than one sub-`Block` will fail
validation due to missing external data.

Saving using a deep copy of the global model circumvents this.

Fixes microsoft#22411
@jkbeavers
Copy link
Contributor Author

Fixed in a5e85a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

No branches or pull requests

1 participant