Skip to content

Latest commit

 

History

History
127 lines (107 loc) · 6.3 KB

DebuggingNumericalError.md

File metadata and controls

127 lines (107 loc) · 6.3 KB

Debugging Numerical Error

Use utils/RunONNXModel.py python script to debug numerical errors, when onnx-mlir-compiled inference executable produces numerical results that are inconsistent with those produced by the training framework. This python script will run the model through onnx-mlir and a reference backend, and compare the intermediate results produced by these two backends layer by layer.

Prerequisite

  • Set ONNX_MLIR_HOME environment variable to be the path to the HOME directory for onnx-mlir. The HOME directory for onnx-mlir refers to the parent folder containing the bin, lib, etc sub-folders in which ONNX-MLIR executables and libraries can be found.

Reference backend

Outputs by onnx-mlir can be verified by using a reference ONNX backend or reference inputs and outputs in protobuf.

  • To verify using a reference backend, install onnxruntime by running pip install onnxruntime. To use a different testing backend, simply replace code importing onnxruntime to some other ONNX-compliant backend.
  • To verify using reference outputs, use --verify=ref --ref_folder=ref_folder where ref_folder is the path to a folder containing protobuf files for inputs and outputs. This guideline is a how-to for creating protobuf files from numpy arrays.

Usage

utils/RunONNXModel.py supports the following command-line options:

$ python ../utils/RunONNXModel.py  --help
usage: RunONNXModel.py [-h]
                       [--print_input]
                       [--print_output]
                       [--save_onnx PATH]
                       [--save_so PATH | --load_so PATH]
                       [--save_data PATH]
                       [--data_folder DATA_FOLDER | --shape_info SHAPE_INFO]
                       [--compile_args COMPILE_ARGS]
                       [--verify {onnxruntime,ref}]
                       [--verify_all_ops]
                       [--compile_using_input_shape]
                       [--rtol RTOL]
                       [--atol ATOL]
                       model_path

positional arguments:
  model_path            Path to the ONNX model

optional arguments:
  -h, --help            show this help message and exit
  --print_input         Print out inputs
  --print_output        Print out inference outputs produced by onnx-mlir
  --save_onnx PATH      File path to save the onnx model
  --save_so PATH        File path to save the generated shared library of the
                        model
  --load_so PATH        File path to load a generated shared library for
                        inference, and the ONNX model will not be re-compiled
  --save_data PATH      Path to a folder to save the inputs and outputs in
                        protobuf
  --data_folder DATA_FOLDER
                        Path to a folder containing inputs and outputs stored
                        in protobuf. If --verify=ref, inputs and outputs are
                        reference data for verification
  --shape_info SHAPE_INFO
                        Shape for each dynamic input of the model, e.g.
                        0:1x10x20,1:7x5x3. Used to generate random inputs for
                        the model if --data_folder is not set
  --compile_args COMPILE_ARGS
                        Arguments passed directly to onnx-mlir command. See
                        bin/onnx-mlir --help
  --verify {onnxruntime,ref}
                        Verify the output by using onnxruntime or reference
                        inputs/outputs. By default, no verification
  --verify_all_ops      Verify all operation outputs when using onnxruntime.
  --compile_using_input_shape
                        Compile the model by using the shape info getting from
                        the inputs in data folder. Must set --data_folder
  --rtol RTOL           Relative tolerance for verification
  --atol ATOL           Absolute tolerance for verification

Debugging the Code Generated for an Operator.

If you know, or suspect, that a particular ONNX MLIR operator produces an incorrect result, and want to narrow down the problem, we provide a couple of useful Krnl operators that allow printing (at runtime) the value of a tensor, or a value that has a primitive data type.

To print out the value of a tensor at a particular program point, inject the following code (where X is the tensor to be printed):

create.krnl.printTensor("Tensor X: ", X);

Note: currently the content of the tensor is printed only when the tensor rank is less than four.

To print a message followed by one value, inject the following code (where val is the value to be printed and valType is its type):

create.krnl.printf("inputElem: ", val, valType);

Finding memory errors

If you know, or suspect, that an onnx-mlir-compiled inference executable suffers from memory allocation related issues, the valgrind framework or mtrace memory tool can be used to facilitate debugging. These tools trace memory allocation/free-related APIs, and can detect memory issues, such as memory leaks.

However if the problems relating to memory access, especially buffer overrun problems, are notoriously difficult to debug because run-time errors occur outside of the point containing the problem. The "Electric Fence library" can be used for debugging these problems. It helps you detect two common programming problems: software that overruns the boundaries of a malloc() memory allocation, and software that touches a memory allocation that has been released by free(). Unlike other memory debuggers, Electric Fence will detect read accesses as well as writes, and it will pinpoint the exact instruction that causes an error.

Since the Electric Fence library is not officially supported by RedHat, you need to download, build and install the source code by yourself on yours. After installing it, link this library by using the "-lefence" option when generating inference executables. Then simply execute it, which will cause a runtime error and stop at the place causing memory access problems. You can identify the place with a debugger or debugging print functions described in the previous section.