Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT compatible retinanet #4395

Closed
julienripoche opened this issue Sep 13, 2021 · 10 comments
Closed

TensorRT compatible retinanet #4395

julienripoche opened this issue Sep 13, 2021 · 10 comments

Comments

@julienripoche
Copy link
Contributor

julienripoche commented Sep 13, 2021

🚀 The feature

The possibility to compile ONNX exported retinanet model with tensorRT.

Motivation, pitch

I'm working with the torchvision retinanet implementation and have some production constraints regarding inference time. I think it would be great if the ONNX export of retinanet could be further compiled in tensorRT.

Alternatives

No response

Additional context

Actually, I already managed to make it work.
I exported the retinanet model to onnx with opset_version=11, then compiled it in tensorRT 8.0.1.
To do that I bypassed two preprocessing steps in the GeneralizedRCNNTransform call:

  • resize, as it contains a Floor operator not compatible with tensorRT
[09/08/2021-13:14:04] [E] [TRT] ModelImporter.cpp:725: ERROR: ModelImporter.cpp:179 In function parseGraph:
[6] Invalid Node - Resize_43
[graph.cpp::computeInputExecutionUses::519] Error Code 9: Internal Error (Floor_30: IUnaryLayer cannot be used to compute a shape tensor)
  • batch_images, as it contains a Pad operator not compatible with tensorRT
[09/08/2021-13:12:27] [E] [TRT] ModelImporter.cpp:725: ERROR: builtin_op_importers.cpp:2984 In function importPad:
[8] Assertion failed: inputs.at(1).is_weights() && "The input pads is required to be an initializer."

I also replaced the type of the two torch.arange, in shifts_x and shifts_y of the AnchorGenerator call, from torch.float32 to torch.int32, as the current version of tensorRT does not support this.

[09/08/2021-14:58:35] [E] [TRT] ModelImporter.cpp:725: ERROR: builtin_op_importers.cpp:3170 In function importRange:
[8] Assertion failed: inputs.at(0).isInt32() && "For range operator with dynamic inputs, this version of TensorRT only supports INT32!"

And finally I bypassed the postprocessing operation of RetinaNet:

[09/08/2021-15:14:12] [I] [TRT] No importer registered for op: NonZero. Attempting to import as plugin.
[09/08/2021-15:14:12] [I] [TRT] Searching for plugin: NonZero, plugin_version: 1, plugin_namespace: 
[09/08/2021-15:14:12] [E] [TRT] 3: getPluginCreator could not find plugin: NonZero version: 1
[09/08/2021-15:14:12] [E] [TRT] ModelImporter.cpp:720: While parsing node number 729 [NonZero -> "2086"]:
[09/08/2021-15:14:12] [E] [TRT] ModelImporter.cpp:721: --- Begin node ---
[09/08/2021-15:14:12] [E] [TRT] ModelImporter.cpp:722: input: "2085"
output: "2086"
name: "NonZero_729"
op_type: "NonZero"

[09/08/2021-15:14:12] [E] [TRT] ModelImporter.cpp:723: --- End node ---
[09/08/2021-15:14:12] [E] [TRT] ModelImporter.cpp:725: ERROR: builtin_op_importers.cpp:4643 In function importFallbackPluginImporter:
[8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"

Im my case it is fine to make preprocessing and postprocessing outside of the RetinaNet call.
So my request is actually only on the AnchorGenerator, i.e. changing the type of the torch.arange operations from torch.float32 to torch.int32.

cc @datumbox

@datumbox
Copy link
Contributor

datumbox commented Sep 14, 2021

@julienripoche Thanks for the proposal.

The preprocessing and post-processing steps are quite important and we can't bypass them in the general case. Nevertheless since the proposed modification happens only in the AnchorGenerator and only internally, it might be possible, provided it does not have an other side-effects that will affect the accuracy or behaviour of the model. In other words, if the change from float32 to int32 does not break any tests and does not affect the accuracy of pre-trained models then I'm happy to review and discuss a PR that investigates changing the behaviour.

@julienripoche
Copy link
Contributor Author

@datumbox Thanks for your consideration :)

I made the little modifications on anchor_utils.py and submitted the PR.

@datumbox
Copy link
Contributor

datumbox commented Sep 14, 2021

Thanks @julienripoche, I'll have a look.

Note that it might take a while to fully investigate its effects as we will need to a) ensure that all existing pre-trained models maintain the same level of accuracy, b) no internal FB code breaks and c) there are no other unintended consequences.

datumbox added a commit that referenced this issue Sep 16, 2021
…_utils.py (#4395) (#4409)

Co-authored-by: Julien RIPOCHE <ripoche@magic-lemp.com>
Co-authored-by: Vasilis Vryniotis <datumbox@users.noreply.github.com>
facebook-github-bot pushed a commit that referenced this issue Sep 30, 2021
…in anchor_utils.py (#4395) (#4409)

Summary:

Reviewed By: datumbox

Differential Revision: D31268024

fbshipit-source-id: 0294ad05fc94bdf5a6d3eba50d85813d568e8fbe

Co-authored-by: Julien RIPOCHE <ripoche@magic-lemp.com>
Co-authored-by: Vasilis Vryniotis <datumbox@users.noreply.github.com>
@montmejat
Copy link

rocket The feature

The possibility to compile ONNX exported retinanet model with tensorRT.

Motivation, pitch

I'm working with the torchvision retinanet implementation and have some production constraints regarding inference time. I think it would be great if the ONNX export of retinanet could be further compiled in tensorRT.

Alternatives

No response

Additional context

Actually, I already managed to make it work. I exported the retinanet model to onnx with opset_version=11, then compiled it in tensorRT 8.0.1. To do that I bypassed two preprocessing steps in the GeneralizedRCNNTransform call:

* [resize](https://github.com/pytorch/vision/blob/c359d8d56242997e6209b71524d7a6199ea333b2/torchvision/models/detection/transform.py#L112), as it contains a Floor operator not compatible with tensorRT
[09/08/2021-13:14:04] [E] [TRT] ModelImporter.cpp:725: ERROR: ModelImporter.cpp:179 In function parseGraph:
[6] Invalid Node - Resize_43
[graph.cpp::computeInputExecutionUses::519] Error Code 9: Internal Error (Floor_30: IUnaryLayer cannot be used to compute a shape tensor)
* [batch_images](https://github.com/pytorch/vision/blob/c359d8d56242997e6209b71524d7a6199ea333b2/torchvision/models/detection/transform.py#L118), as it contains a Pad operator not compatible with tensorRT
[09/08/2021-13:12:27] [E] [TRT] ModelImporter.cpp:725: ERROR: builtin_op_importers.cpp:2984 In function importPad:
[8] Assertion failed: inputs.at(1).is_weights() && "The input pads is required to be an initializer."

I also replaced the type of the two torch.arange, in shifts_x and shifts_y of the AnchorGenerator call, from torch.float32 to torch.int32, as the current version of tensorRT does not support this.

[09/08/2021-14:58:35] [E] [TRT] ModelImporter.cpp:725: ERROR: builtin_op_importers.cpp:3170 In function importRange:
[8] Assertion failed: inputs.at(0).isInt32() && "For range operator with dynamic inputs, this version of TensorRT only supports INT32!"

And finally I bypassed the postprocessing operation of RetinaNet:

* [postprocess_detections](https://github.com/pytorch/vision/blob/c359d8d56242997e6209b71524d7a6199ea333b2/torchvision/models/detection/retinanet.py#L550), as it contains a where operation not compatible with tensorRT
[09/08/2021-15:14:12] [I] [TRT] No importer registered for op: NonZero. Attempting to import as plugin.
[09/08/2021-15:14:12] [I] [TRT] Searching for plugin: NonZero, plugin_version: 1, plugin_namespace: 
[09/08/2021-15:14:12] [E] [TRT] 3: getPluginCreator could not find plugin: NonZero version: 1
[09/08/2021-15:14:12] [E] [TRT] ModelImporter.cpp:720: While parsing node number 729 [NonZero -> "2086"]:
[09/08/2021-15:14:12] [E] [TRT] ModelImporter.cpp:721: --- Begin node ---
[09/08/2021-15:14:12] [E] [TRT] ModelImporter.cpp:722: input: "2085"
output: "2086"
name: "NonZero_729"
op_type: "NonZero"

[09/08/2021-15:14:12] [E] [TRT] ModelImporter.cpp:723: --- End node ---
[09/08/2021-15:14:12] [E] [TRT] ModelImporter.cpp:725: ERROR: builtin_op_importers.cpp:4643 In function importFallbackPluginImporter:
[8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"

Im my case it is fine to make preprocessing and postprocessing outside of the RetinaNet call. So my request is actually only on the AnchorGenerator, i.e. changing the type of the torch.arange operations from torch.float32 to torch.int32.

cc @datumbox

Hello @julienripoche, would you mind explaining how you bypassed these steps, if you remember? I'm very interested into knowing how you did it as I'm trying to achieve the same goal 😄 With these bypasses, where you able to achieve some interesting performance gains?

@julienripoche
Copy link
Contributor Author

Hi @aurelien-m, of course ;)

Basically what I did is replacing some part of the code by the identity.
Here is the code that I used to achieve that.

import torch

# Load retinanet
pth_path = "/path/to/retinanet.pth"
retinanet = torch.load(pth_path, map_location="cpu")
retinanet.eval()

# Image sizes
original_image_size = (677, 511)

# Normalize hack
normalize_tmp = retinanet.transform.normalize
retinanet_normalize = lambda x: normalize_tmp(x)
retinanet.transform.normalize = lambda x: x

# Resize hack
resize_tmp = retinanet.transform.resize
retinanet_resize = lambda x: resize_tmp(x, None)[0]
retinanet.transform.resize = lambda x, y: (x, y)

# Batch images hack
# /!\ torchvision version dependent ???
# retinanet.transform.batch_images = lambda x, size_divisible: x[0].unsqueeze(0)
retinanet.transform.batch_images = lambda x: x[0].unsqueeze(0)

# Generate dummy input
def preprocess_image(img):
    result = retinanet_resize(retinanet_normalize(img)[0]).unsqueeze(0)
    return result
dummy_input = torch.randn(1, 3, original_image_size[0], original_image_size[1])
dummy_input = preprocess_image(dummy_input)
image_size = tuple(dummy_input.shape[2:])
print(dummy_input.shape)

# Postprocess detections hack
postprocess_detections_tmp = retinanet.postprocess_detections
retinanet_postprocess_detections = lambda x: postprocess_detections_tmp(x["split_head_outputs"], x["split_anchors"], [image_size])
retinanet.postprocess_detections = lambda x, y, z: {"split_head_outputs": x, "split_anchors": y}

# Postprocess hack
postprocess_tmp = retinanet.transform.postprocess
retinanet_postprocess = lambda x: postprocess_tmp(x, [image_size], [original_image_size])
retinanet.transform.postprocess = lambda x, y, z: x

# ONNX export
onnx_path = "/path/to/retinanet.onnx"
torch.onnx.export(
    retinanet,
    dummy_input,
    onnx_path,
    verbose=False,
    opset_version=11,
    input_names = ["images"],
)

The resulting ONNX should almost only contain the network itself, plus some anchor treatment.
This ONNX should be compilable by tensorRT.

That said, maybe a simpler way to achieve this would have been to simply replace the forward method by a "simpler" one.

About performance gain, I don't remember exactly.
Running some old comparison I can tell you that compiling the model with float16 and adding preprocess and postprocess, the model is around 2 times faster than the original model, i.e. without bypass, exported in ONNX.

Hope it helps :)

@montmejat
Copy link

@julienripoche Thanks man, it really helped me out!

However, if I understand correctly, this only works for a batch of size 1? Do you know if I can make it work for a higher batch size? I was able to convert it to TensorRT and do some inference, but I'm getting slower inferences than the original PyTorch model.

@montmejat
Copy link

Nevermind, I found out how to infer a batch of multiple images. I used:

retinanet.transform.batch_images = lambda x, size_divisible: torch.stack(x)

It seems to work from what I can see, I'll do some more testing and look at the performance that I get with a higher batch size. Thanks again!

@Michelvl92
Copy link

Is there any info on the improvement of the inference time/latency with exported TensorRT in comparison with ONNX or PyTorch?

@montmejat
Copy link

Is there any info on the improvement of the inference time/latency with exported TensorRT in comparison with ONNX or PyTorch?

On my side, I was able to achieve a x2 to x3 in speeds depending on the hardware, from Pytorch to TensorRT (I don't have the exact numbers anymore, sorry!)

@ChaosAIVision
Copy link

Hi, i use your code to convert pytorch retinanet to tensorrt , it was successful ! , but i got a problem that the output is not retina format, i need u help :(
Inference output: [-22.813814 -20.028933 -23.652468 ... -51.884644 -52.954933 -55.01 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants