Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MobileFormer] Converted model outputs values mismatch with original ones. #105

Closed
kevinz8866 opened this issue Jan 10, 2023 · 11 comments
Closed
Labels
OP:Transpose OP:Transpose Parameter replacement Use Parameter replacement Training Training Transformer Transformer

Comments

@kevinz8866
Copy link

kevinz8866 commented Jan 10, 2023

Issue Type

Others

onnx2tf version number

1.4.2

onnx version number

1.12.0

tensorflow version number

2.10.1

Download URL for ONNX

9e: https://drive.google.com/file/d/1vGzO9MZGX-yGz6ATm4yHVJMASZACuy2t/view?usp=share_link
9t: https://drive.google.com/file/d/1Be-a7Pmo6auyAXHChAhC1qtzkbr5EytJ/view?usp=share_link
12e: https://drive.google.com/file/d/1NR-0Fm5q_ludb5MTWgDPzRzqDFKOsCkw/view?usp=share_link
12t: https://drive.google.com/file/d/1ReuXt4gpbq6O7mpUGudaBKjb6wI43ka6/view?usp=share_link

Parameter Replacement JSON

9e: https://drive.google.com/file/d/1av1dL5sWfjlghf2-jtmws-CR-Tcv_WIP/view?usp=share_link
9t: https://drive.google.com/file/d/18qB_VOW4Xp9PvCN4G-eRRnWhh9o1fF3F/view?usp=share_link
12e: https://drive.google.com/file/d/1T0nTkb5Szf9XjaNVVBTzzu5C9dgOURLd/view?usp=share_link
12t: https://drive.google.com/file/d/1JhH8K64PDJong4qkZbzyyJDKRExE7wlJ/view?usp=share_link

Description

Hi Master PINTO,

Following up on issue #103, the output values of the converted model are different from what is expected. I tried to export the pytorch model in both eval and training and both opset==9 and opset==12. For the models and parameter replacement files, the number stands for opset version, e stands for eval, and t stands for training. This issue happens in all scenarios. Please see this notebook https://drive.google.com/file/d/1mErgTLFiYGTgUnMRDcNp3HkOkSWa96aL/view?usp=share_link to replicate and for issue demonstration.

Also, I see that since you match every onnx operation with a basic tf operation, is there no way to restore those weights as trainable parameters? Thank you so much for keep digging in on this. If you need the original pytorch model or anything else I could assist on, let me know!

@PINTO0309 PINTO0309 added the Bug bug label Jan 10, 2023
@kevinz8866
Copy link
Author

By the way, these models fails to save as a keras model as well, when I run tf.keras.models.save_model, in addition to failure in saving in H5. The same error you found as in #103.

@kevinz8866
Copy link
Author

If you want to use MobileFormer's pytorch repo to debug, I have a fork that fixed all the import issues in the office repo. https://github.com/kevinz8866/MobileFormer

@PINTO0309
Copy link
Owner

PINTO0309 commented Jan 10, 2023

Although experimental, I am adding a validation function to investigate which operations of the model transformation produce errors in the output. I would eventually like to modify the tool itself to automatically check which tensors have large errors.

https://github.com/PINTO0309/onnx2tf/releases/tag/1.5.0

  • [Experimental] Added the ability to validate the model final output tensor in ONNX and TensorFlow.
    https://numpy.org/doc/stable/reference/generated/numpy.allclose.html#numpy-allclose

    numpy.allclose(a, b, rtol=0.0, atol=1e-04, equal_nan=True)
    
      -coto, --check_onnx_tf_outputs_elementwise_close
        Returns true if the two arrays, the output of onnx and the output of TF,
        are elementwise close within an acceptable range.
    
      -cotor CHECK_ONNX_TF_OUTPUTS_ELEMENTWISE_CLOSE_RTOL,\
        --check_onnx_tf_outputs_elementwise_close_rtol CHECK_ONNX_TF_OUTPUTS_ELEMENTWISE_CLOSE_RTOL
        The relative tolerance parameter.
        Default: 0.0
    
      -cotoa CHECK_ONNX_TF_OUTPUTS_ELEMENTWISE_CLOSE_ATOL,\
        --check_onnx_tf_outputs_elementwise_close_atol CHECK_ONNX_TF_OUTPUTS_ELEMENTWISE_CLOSE_ATOL
        The absolute tolerance parameter.
        Default: 1e-4
    
  • This option in combination with the --output_names_to_interrupt_model_conversion, -onimc option can be used to investigate which operations at which locations in the model cause errors in the output.

  • Since ONNX assumes NCHW and TensorFlow assumes NHWC output, a simple comparison of output tensors will not match values in most cases. Therefore, the tool automatically tries to match the final output tensor of TensorFlow to the shape of the output tensor of ONNX with a brute force check. If the shape still does not match or there is no exact matching value combination, Unmatched is assumed.

  • e.g.

    onnx2tf -i xxx.onnx -coto -onimc keypoints descriptors scores scores_map
    

image

@PINTO0309
Copy link
Owner

PINTO0309 commented Jan 11, 2023

  • OK - Add_116 - onnx::MatMul_556
    image
onnx2tf -i mobileformer.onnx -prf replace_kevinz8866.json -onimc onnx::MatMul_556 -cotoa 1e-1

image

  • OK - MatMul_117 - onnx::Add_560
    image
onnx2tf -i mobileformer.onnx -prf replace_kevinz8866.json -onimc onnx::Add_560 -cotoa 1e-1

image

  • OK - Reshape_119 - onnx::Transpose_569
onnx2tf -i mobileformer.onnx -prf replace_kevinz8866.json -onimc onnx::Transpose_569 -cotoa 1e-1

image

  • NG - MatMul_123 - onnx::Mul_580
onnx2tf -i mobileformer.onnx -prf replace_kevinz8866.json -onimc onnx::Mul_580 -cotoa 1e-1

image
image

  • OK - Reshape_122 - onnx::MatMul_579
    image

Thus, we see that there is a problem in the processing of MatMul, the confluence of the models.

It seems that this Transpose interferes with the automatic tool conversion and confuses the dimensional transposition.
image

INFO: onnx_op_type: Reshape onnx_op_name: Reshape_119
INFO:  input_name.1: onnx::Reshape_561 shape: [4, 1, 128] dtype: float32
INFO:  input_name.2: onnx::Reshape_2804 shape: [4] dtype: <class 'numpy.int64'>
INFO:  output_name.1: onnx::Transpose_569 shape: [4, 1, 4, 32] dtype: float32
INFO: tf_op_type: reshape
INFO:  input.1.tensor: name: tf.compat.v1.transpose_6/transpose:0 shape: (4, 1, 128) dtype: <dtype: 'float32'> 
INFO:  input.2.shape: val: [4, 1, 4, -1] 
INFO:  output.1.output: name: tf.reshape_3/Reshape:0 shape: (4, 1, 4, 32) dtype: <dtype: 'float32'> 

INFO: onnx_op_type: Transpose onnx_op_name: Transpose_127
INFO:  input_name.1: onnx::MatMul_579 shape: [1, 4, 32, 4] dtype: float32
INFO:  output_name.1: onnx::MatMul_584 shape: [1, 4, 4, 32] dtype: float32
INFO: tf_op_type: transpose_v2
INFO:  input.1.a: name: tf.reshape_2/Reshape:0 shape: (1, 4, 32, 4) dtype: <dtype: 'float32'> 
INFO:  input.2.perm: val: [0, 1, 3, 2]
INFO:  output.1.output: name: tf.compat.v1.transpose_7/transpose:0 shape: (1, 4, 4, 32) dtype: <dtype: 'float32'> 

INFO: onnx_op_type: Transpose onnx_op_name: Transpose_120
INFO:  input_name.1: onnx::Transpose_569 shape: [4, 1, 4, 32] dtype: float32
INFO:  output_name.1: onnx::MatMul_570 shape: [1, 4, 4, 32] dtype: float32
INFO: tf_op_type: transpose_v2
INFO:  input.1.a: name: tf.reshape_3/Reshape:0 shape: (4, 1, 4, 32) dtype: <dtype: 'float32'> 
INFO:  input.2.perm: val: [1, 0, 2, 3]   <<================================================= Here
INFO:  output.1.output: name: tf.compat.v1.transpose_8/transpose:0 shape: (1, 4, 4, 32) dtype: <dtype: 'float32'> 

INFO: onnx_op_type: MatMul onnx_op_name: MatMul_123
INFO:  input_name.1: onnx::MatMul_570 shape: [1, 4, 4, 32] dtype: float32
INFO:  input_name.2: onnx::MatMul_579 shape: [1, 4, 32, 4] dtype: float32
INFO:  output_name.1: onnx::Mul_580 shape: [1, 4, 4, 4] dtype: float32
INFO: tf_op_type: matmul
INFO:  input.1.a: name: tf.compat.v1.transpose_8/transpose:0 shape: (1, 4, 4, 32) dtype: <dtype: 'float32'> 
INFO:  input.2.b: name: tf.reshape_2/Reshape:0 shape: (1, 4, 32, 4) dtype: <dtype: 'float32'> 
INFO:  input.3.output_type: name: float32 shape: () 
INFO:  output.1.output: name: tf.linalg.matmul_4/MatMul:0 shape: (1, 4, 4, 4) dtype: <dtype: 'float32'> 

Therefore, add a parameter to the JSON to disable Transpose, which would confuse the transposition. The tool internally and automatically attempts to convert the perm attribute of Transpose from NCHW to NHWC. Therefore, if the model has unnecessary transpose from the beginning, it may generate wrong transpose. The following JSON forces the perm attribute of Transpose to fix the behavior of the tool and disable the automatic perm NHWC conversion behavior.

  • replace_kevinz8866.json

    {
      "format_version": 1,
      "operations": [
        {
          "op_name": "Reshape_247",
          "param_target": "outputs",
          "param_name": "onnx::Add_753",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_418",
          "param_target": "outputs",
          "param_name": "onnx::Add_1015",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_588",
          "param_target": "outputs",
          "param_name": "onnx::Add_1275",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_759",
          "param_target": "outputs",
          "param_name": "onnx::Add_1537",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_929",
          "param_target": "outputs",
          "param_name": "onnx::Add_1797",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_1098",
          "param_target": "outputs",
          "param_name": "onnx::Add_2056",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_1269",
          "param_target": "outputs",
          "param_name": "onnx::Add_2318",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_1439",
          "param_target": "outputs",
          "param_name": "onnx::Add_2578",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Transpose_120",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [1,2,0,3]
        }
      ]
    }
  • OK - MatMul_123 - onnx::Mul_580

onnx2tf -i mobileformer.onnx -prf replace_kevinz8866.json -onimc onnx::Mul_580 -cotoa 1e-1

image

image

From a bird's eye view of the model, it appears to have multiple Reshape -> Transpose -> MatMul structures, so I would need to go through the same steps to modify the tool's behavior. It is a bit tedious.

@PINTO0309 PINTO0309 added Parameter replacement Use Parameter replacement OP:Transpose OP:Transpose and removed Bug bug labels Jan 11, 2023
@PINTO0309
Copy link
Owner

PINTO0309 commented Jan 11, 2023

Also, I see that since you match every onnx operation with a basic tf operation, is there no way to restore those weights as trainable parameters? Thank you so much for keep digging in on this. If you need the original pytorch model or anything else I could assist on, let me know!

The tool targets specialized transformations to inferrable models, and it is very hard to build trainable models. Instead of restricting functionality, the structure of the model is optimized to the limit. Thus, operations that are necessary only during training and unnecessary during inference, such as BatchNormalization and Dropout, are intentionally separated and fused to optimize and disappear from the model.

  1. First, define the model structure as a Functional model in Keras. This means that the Python code defines the model structure.
  2. Extracts only the weights from the transformed model. At this time the ability to extract weights is not present in this tool, but I will try to add the feature in the future. At this time, weights can be extracted using Netron. Clicking on the floppy disk icon saves a binary file in np.ndarray format to storage.
    image
  • e.g.
    import numpy as np
    print(np.load('tensor'))
    
    input_weights = np.load('tensor')
    
    [[[[ 3.85814160e-02 -5.91791682e-02  4.93610911e-02]
       [-1.54751437e-02 -2.65599549e-01  2.85035670e-01]
       [-2.80502457e-02 -2.88973451e-01  2.98822671e-01]]
    
      [[ 4.93560582e-02 -6.98780492e-02  6.08931743e-02]
       [-1.96552109e-02 -4.20744419e-01  4.15196180e-01]
       [-1.72943212e-02 -4.09793824e-01  4.53572363e-01]]
    
      [[-2.43881089e-03 -2.27344222e-02  3.65414075e-04]
       [ 1.85506195e-02 -1.68236122e-01  1.95923150e-01]
       [-5.82838431e-03 -1.79103076e-01  1.61797389e-01]]]
    
    
     [[[ 1.88791680e+00  4.31701469e+00  4.70997620e+00]
  1. Load the weights extracted in 2. into the Keras model as initializers. If you need to set the bias, use bias_initializer.
    from tensorflow.python.keras.layers import Conv2D
    
    Conv2D(
        filters=input_weights.shape[-1],
        kernel_size=input_weights.shape[:2],
        strides=strides,
        padding='valid',
        dilation_rate=dilations,
        groups=group,
        use_bias=False,
        kernel_initializer=tf.keras.initializers.constant(input_weights),
        name='dummy_conv2d',
    )(input_tensor)

@PINTO0309
Copy link
Owner

PINTO0309 commented Jan 11, 2023

Added the ability for the tool to automatically identify operations with large model accuracy errors.
--check_onnx_tf_outputs_elementwise_close_full option.

Kazam_screencast_00100_.mp4

@PINTO0309
Copy link
Owner

By the way, these models fails to save as a keras model as well, when I run tf.keras.models.save_model, in addition to failure in saving in H5. The same error you found as in #103.

It is not possible to save the entire structure of the model to an h5 file, but I have added the ability to extract only the weights and save them to an h5 file. hdf5 format files are output.

-ow, --output_weights
  Output weights in hdf5 format.

image
image

@PINTO0309
Copy link
Owner

PINTO0309 commented Jan 12, 2023

The final outputs are now nearly identical.

  • MobileFormer-e9.onnx
    https://drive.google.com/file/d/1vGzO9MZGX-yGz6ATm4yHVJMASZACuy2t/view?usp=share_link
  • MobileFormer-e9.tflite
    model_float32.tflite.tar.gz
  • replace_kevinz8866.json
    {
      "format_version": 1,
      "operations": [
        {
          "op_name": "Reshape_247",
          "param_target": "outputs",
          "param_name": "onnx::Add_753",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_418",
          "param_target": "outputs",
          "param_name": "onnx::Add_1015",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_588",
          "param_target": "outputs",
          "param_name": "onnx::Add_1275",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_759",
          "param_target": "outputs",
          "param_name": "onnx::Add_1537",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_929",
          "param_target": "outputs",
          "param_name": "onnx::Add_1797",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_1098",
          "param_target": "outputs",
          "param_name": "onnx::Add_2056",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_1269",
          "param_target": "outputs",
          "param_name": "onnx::Add_2318",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Reshape_1439",
          "param_target": "outputs",
          "param_name": "onnx::Add_2578",
          "post_process_transpose_perm": [0,2,3,1]
        },
        {
          "op_name": "Transpose_120",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [1,2,0,3]
        },
        {
          "op_name": "Softmax_126",
          "param_target": "attributes",
          "param_name": "axis",
          "values": 3
        },
        {
          "op_name": "Transpose_129",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [2,0,1,3]
        },
        {
          "op_name": "Transpose_291",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [1,2,0,3]
        },
        {
          "op_name": "Softmax_297",
          "param_target": "attributes",
          "param_name": "axis",
          "values": 3
        },
        {
          "op_name": "Transpose_300",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [2,0,1,3]
        },
        {
          "op_name": "Transpose_462",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [1,2,0,3]
        },
        {
          "op_name": "Softmax_468",
          "param_target": "attributes",
          "param_name": "axis",
          "values": 3
        },
        {
          "op_name": "Transpose_471",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [2,0,1,3]
        },
        {
          "op_name": "Transpose_632",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [1,2,0,3]
        },
        {
          "op_name": "Softmax_638",
          "param_target": "attributes",
          "param_name": "axis",
          "values": 3
        },
        {
          "op_name": "Transpose_641",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [2,0,1,3]
        },
        {
          "op_name": "Transpose_803",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [1,2,0,3]
        },
        {
          "op_name": "Softmax_809",
          "param_target": "attributes",
          "param_name": "axis",
          "values": 3
        },
        {
          "op_name": "Transpose_812",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [2,0,1,3]
        },
        {
          "op_name": "Transpose_973",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [1,2,0,3]
        },
        {
          "op_name": "Softmax_979",
          "param_target": "attributes",
          "param_name": "axis",
          "values": 3
        },
        {
          "op_name": "Transpose_982",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [2,0,1,3]
        },
        {
          "op_name": "Transpose_1142",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [1,2,0,3]
        },
        {
          "op_name": "Softmax_1148",
          "param_target": "attributes",
          "param_name": "axis",
          "values": 3
        },
        {
          "op_name": "Transpose_1151",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [2,0,1,3]
        },
        {
          "op_name": "Transpose_1313",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [1,2,0,3]
        },
        {
          "op_name": "Softmax_1319",
          "param_target": "attributes",
          "param_name": "axis",
          "values": 3
        },
        {
          "op_name": "Transpose_1322",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [2,0,1,3]
        },
        {
          "op_name": "Transpose_1444",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [1,2,0,3]
        },
        {
          "op_name": "Softmax_1449",
          "param_target": "attributes",
          "param_name": "axis",
          "values": 3
        },
        {
          "op_name": "Transpose_1452",
          "param_target": "attributes",
          "param_name": "perm",
          "values": [2,0,1,3]
        }
      ]
    }
$ onnx2tf -i mobileformer.onnx -prf replace_kevinz8866.json -coto

or

$ onnx2tf -i mobileformer.onnx -prf replace_kevinz8866.json -rerf -coto

image

A unique feature of TransFormer is that there are several blocks that prevent the tool from converting, as shown in the figure below. Therefore, I checked the structure in Netron and made the same behavioral changes for blocks with the same structure; I copied and pasted most of the JSON.
image

@PINTO0309
Copy link
Owner

Because of the unpredictable transpositions in the tool's automatic transformations, I have implemented a number of enhancements to identify where the errors occur.

I believe that MobileFormer other than e9 can be converted with almost the same accuracy by changing the behavior of the tool based on the same criteria.

Once I close this issue, if you have successfully converted a model other than e9, I think many researchers and engineers would be pleased if you could pull request a sample JSON file here.
https://github.com/PINTO0309/onnx2tf/tree/main/json_samples
image

@PINTO0309 PINTO0309 changed the title [MobileFormer]Converted model outputs values mismatch with original ones. [MobileFormer] Converted model outputs values mismatch with original ones. Jan 12, 2023
@PINTO0309
Copy link
Owner

PINTO0309 commented Jan 23, 2023

@kevinz8866 Fixed a fatal bug, allowing converted models to be output to Keras (.h5). However, trainable=False.
https://github.com/PINTO0309/onnx2tf/releases/tag/1.5.30

@kevinz8866
Copy link
Author

Hi PINTO,

Thank you so much all these updates. Sorry I was able to get back to you. I was traveling and a bit busy with the lunar new year. I will try to replicate what you had for me and yes, I will post some json files if I converted more models from onnx to keras. Thank you so much for making this package. I will keep using it and let you know if there is any issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OP:Transpose OP:Transpose Parameter replacement Use Parameter replacement Training Training Transformer Transformer
Projects
None yet
Development

No branches or pull requests

2 participants