[TOPI, Relay] A new NMS op variant for ONNX NMS / TF Combined NMS #7796

masahi · 2021-04-05T06:00:32Z

This PR adds a new variant of NMS that better supports NMS in ONNX and combined NMS in TF than the existing NMS operator in TOPI/Relay. For now, I'm calling this variant of NMS "All class NMS".

https://github.com/onnx/onnx/blob/master/docs/Operators.md#NonMaxSuppression
https://www.tensorflow.org/api_docs/python/tf/image/combined_non_max_suppression

The biggest difference between our NMS and "All class NMS" is that in our NMS, a single box is associated with a single class, while in the latter case a single box has scores for all classes, and NMS is performed for each class separately. max_out_size parameter is also applied per class, rather than to all boxes as in our implementation.

Until now, we've been supporting this variant of NMS via complicated encodings using our implementation of NMS. It kind of "works" in practice, but there are many problems to it:

ONNX NMS converter [ONNX] NMS in ONNX #6839 is extremely complicated, and performance is bad because it does small NMS repeatedly inside Relay while loop. It also easily introduces the "zero box problem", because "All class NMS" encoded via one-class NMS is more likely to result in zero detection. We needed to add an ad hoc patch like [Vulkan] Workaround for zero size allocation #7691 to workaround this problem.
[Frontend][TensorFlow] Support CombinedNonMaxSuppression #7520 has a bug in max_out_size handling. Since in our NMS max_out_size is applied to all boxes, we cannot translate "All class NMS" into a single call to our NMS.

For these reasons, I decided it is better to introduce a new variant of NMS to overcome these pains. This breaks our general "one operator to support all frameworks" philosophy, but the two variants of NMS are so different it doesn't make sense to call them the same op.

The result is significant: using the new NMS, I got speedup of 1 second on mlperf SSD resnet34, running on vk + amd. This is an extreme case in that the existing approach that calls get_valid_counts and non_maximum_surpression 80 times in a while loop is extremely slow on vk + amd for some reason, taking literally 1 second. Now it is only 5.7 milli second.

Implementation details

The new NMS implementation consists of the following steps:

Sort scores and return both sorted scores and indices. The existing NMS only uses sorted indices, while I also used sorted scores to do binary search, next.
Do binary search on sorted scores to find the index of the box whose score is just below score_threshold. This gives what we call valid_count[i] in the existing NMS, computed by get_valid_counts.
Do NMS, parallelized over batch * class. The inner loop uses the same NMS IR as the existing one.
After the previous step, we end up with indices of size (batch * num_class, num_boxes) and a tensor num_detections of size (batch * num_class,) holding the number of survived boxes per row. We need to copy num_detections[i] indices from each row into a one linear output. This is a perfect application of exclusive scan: Doing ex scan on num_detections gives row offsets to write into for each row.

The efficiency of the new implementation comes from:

get_valid_count call is replaced with binary search
Per class independent NMS are done in parallel across different blocks on GPU. This alone gives num_classx speedup over the existing encoding in ONNX / TF frontend.
The ONNX NMS frontend is now trivial and there is no triple nested loop etc. It seems overhead on the host side is very large on vk + amd.

mbrookhart · 2021-04-05T16:31:58Z

Exciting! I won't review too much because it's still in draft, but the overall structure looks good to me. I agree that that relay-level loop was a problem since we couldn't parallelize it.

I'm seeing some removed attributes from the normal NMS op, I didn't add those as part of the ONNX NMS PR, I imagine they're used by other frontends?

tqchen · 2021-04-05T17:13:09Z

awesome. it would be great to also discuss the naming of the API(by checking with the naming in the existing libraries e.g. TF, Pytorch). e.g. shall we call it combined_non_max_suppression as per TF's naming convention?

masahi · 2021-04-05T20:40:35Z

I'm seeing some removed attributes from the normal NMS op, I didn't add those as part of the ONNX NMS PR, I imagine they're used by other frontends?

@mbrookhart These attributes are now runtime arguments and they are not used anymore, see

tvm/src/relay/op/vision/nms.cc

Lines 106 to 113 in 8131364

    
           auto attrs = make_object<NonMaximumSuppressionAttrs>(); 
        
           attrs->force_suppress = force_suppress; 
        
           attrs->top_k = top_k; 
        
           attrs->coord_start = coord_start; 
        
           attrs->score_index = score_index; 
        
           attrs->id_index = id_index; 
        
           attrs->return_indices = return_indices; 
        
           attrs->invalid_to_bottom = invalid_to_bottom;

To avoid confusion I removed them.

masahi · 2021-04-05T20:46:46Z

awesome. it would be great to also discuss the naming of the API(by checking with the naming in the existing libraries e.g. TF, Pytorch). e.g. shall we call it combined_non_max_suppression as per TF's naming convention?

combined_non_max_suppression was also the name I started with, but I didn't particularly like it because it doesn't tell what exactly is "combined".

Moreover, in TF the other NMS op is the standard, single class NMS, so for them combined or not is really single vs multiclass distinction. In contrast, our existing NMS also does multiclass NMS where a single box is associated with a single class. This variant of multiclass NMS has a different notion of "multiclass" and can support PyTorch and MXNet. So I don't think combined_nms is a good choice for us.

all_class_non_maximum_surpression is also what TensorRT calls it https://github.com/NVIDIA/TensorRT/blob/master/plugin/common/kernels/allClassNMS.cu I don't have a better alternative than this.

tqchen · 2021-04-06T11:41:52Z

Thanks @masahi , can you do a quick round of search of pytorch, onnx, tensorrt and others and summarize the naming rationale? The main thing is to write down why we pick the name, I am less attached to a particular name.

masahi · 2021-04-07T11:33:32Z

ok I see roughly three categories of NMS used in frameworks:

MXNet and TVM https://mxnet.apache.org/versions/1.7.0/api/python/docs/api/ndarray/contrib/index.html#mxnet.ndarray.contrib.box_nms This is what MXNet calls box_nms, this API is highly non-standard but unfortunately this is what we inherited from for topi/relay. It supports multiclass NMS but a single box is associated with only one class, unlike the third category below.
PT torchvision.ops.nms and TF non_max_suppression, this is the standard, single class NMS.
https://pytorch.org/vision/stable/ops.html#torchvision.ops.nms
https://www.tensorflow.org/api_docs/python/tf/image/non_max_suppression
ONNX NonMaxSuppression, TF combined_non_max_suppression, and TensorRT batchedNMSPlugin (the implementation calls it allClassNMS) This is the variant of multi class NMS where a single box can be selected multiple times per different classes.
https://github.com/onnx/onnx/blob/master/docs/Operators.md#NonMaxSuppression
https://www.tensorflow.org/api_docs/python/tf/image/combined_non_max_suppression
https://github.com/NVIDIA/TensorRT/tree/master/plugin/batchedNMSPlugin, https://github.com/NVIDIA/TensorRT/blob/master/plugin/common/kernels/allClassNMS.cu

So the bottom line is, I think it is reasonable to say that NMS, without adjectives, should refer to the single class variant (category 2 above), but there isn't a consensus on what to call the third category one which this PR is about.

I think all_class_non_maximum_surpression or per_class_non_maximum_surpression are the most descriptive of what it does. Either one is fine for me and I'm open to other suggestions @mbrookhart @jwfromm

electriclilies · 2021-04-07T21:09:23Z

I think that the name per_class_non_maximum_supression implies that there is one class label per bounding box, but the ONNX version allows a box to be selected by multiple classes, so I prefer all_class_non_maximum_supression over per_class_non_maximum_supression

masahi · 2021-04-07T21:28:04Z

By per_class I meant we do NMS for each class separately (unlike our existing NMS), but if this is already confusing then yes, all_class_non_maximum_supression might be better.

mbrookhart · 2021-04-07T22:03:42Z

I think that sounds good.

trevor-m · 2021-04-08T23:50:46Z

Really nice! Just did first pass through the code and overall it looks good.

A couple of thoughts/questions:

TF Combined NMS allows for both a) using same boxes for all classes and b) using different boxes for each class.
For a) the shape of boxes input is (batch_size, 1, num_boxes, 4).
For b) the shape of boxes input is (batch_size, num_classes, num_anchors, 4) .
Currently, your implementation only supports a) and has a box input shape of (batch_size, num_anchors, 4).
I think all of the uses of combinedNMS I have seen in TF have only used a), but perhaps in the future we may want to also add support for b).
 
This is regarding the output format. One common thing after getting the selected indices is using those to gather the coordinates and scores of the selected boxes.
Since the selected indices are in the format [batch_id, class_id, index] we can actually use them directly to index into the scores tensors which has shape (batch, num_classes, num_boxes) to get the selected scores.
But for the selected boxes, we need to slice out the batch_id and index and concat to get them into the format [batch_id, index] before we can index into boxes tensor to get selected boxes.
Is my understanding here correct? I'm wondering if we can improve this somehow.

masahi · 2021-04-09T03:03:11Z

Thanks @trevor-m

Yes, we should be able to support such variant easily. The only place we touch box coords is here (i is the batch or row index, j and k are the index of sorted scores)

tvm/python/tvm/topi/vision/nms_util.py

Lines 132 to 141 in 6d314de

    
           def calc_overlap(i, j, k): 
        
               offset_j = sorted_indices[i, j] * 4 
        
               offset_k = sorted_indices[i, k] * 4 
        
               batch_id = i // num_class 
        
               base_bbox_idx = batch_id * num_anchors * 4 
        
               return calculate_overlap( 
        
                   boxes, 
        
                   base_bbox_idx + offset_j, 
        
                   base_bbox_idx + offset_k, 
        
               )

we can do arbitrary indirect indexing into boxes. We might need to update calculate_overlap function but that would be easy as well.

Yes, the new NMS will return indices, and it would be the job of frontend importer to do the gather. For TF, in particular, the need to return per batch output would be a bit annoying. We can discuss what format of output would be most convenient for TF, and add ret_type attribute to specify which of ONNX or TF "mode" we return indices in.

masahi · 2021-04-09T23:16:52Z

@mbrookhart @jwfromm @trevor-m @Laurawly ready for review

mbrookhart

This looks good to me. I have two minor nitpicks:

The ONNX op has a center_point_box attribute to switch between TF-style boxes and Pytorch-style boxes. I never implemented the pytorch style, but I don't think we need to change this kernel, we should be able to transform them in the future.
It seems like we are now using exactly the same NMS tests for:
The ONNX Node tests I'm automatically importing
The manual ONNX importer tests
The relay tests
The topi tests

This feels...redundant. Do we need the same test at every level?

masahi · 2021-04-13T19:07:21Z

Yes it is redundant, but it's simply too tedious to come up with good test cases for this op. Do you have any suggestion on how to go about this? Ideally I want to test on real workload, a small artificial test case like that doesn't really exercise all the tricky aspects in our NMS kernel.

mbrookhart · 2021-04-13T19:15:48Z

Honestly, I'd be okay with just dropping the topi tests, they seem especially redundant after relay. Perhaps there's a TF test or two we could include with the TF importer? I'm planning on dropping the onnx tests once I get the imported node tests working on more backends.

Again, this a nitpick, I don't think it needs to be in this PR, I just think we generally have too much redundancy in testing and not enough coverage

masahi · 2021-04-13T19:29:56Z

Yes, the way it always works is we first implement topi op with tests, and then relay op. So topi tests always come first. Two tests are indeed redundant and ideally we should have difference test cases. But I'd say it is also weird to delete topi tests on purpose after we have Relay tests. It doesn't hurt to leave topi tests anyway.

trevor-m · 2021-04-14T19:05:54Z

Thanks @masahi
Once this is merged I will open a PR for the TF frontend

masahi · 2021-04-14T19:19:16Z

Thanks @mbrookhart @trevor-m @tqchen @electriclilies

…ache#7796) * initial import * add c++ boilarplate * add python boilarpolate * update onnx frontend * fixing * update onnx frontend * fix shape * minor update * fix * fix shape func * fix for no box * more fix * made things 64 bit * int64 tweak * max_output_size doesn't need to be a callback * remove all_class_nms schedule * minor simplify * remove expand_dim * refactoring * simplify nms loop * cpu all_class_nms stub * updating ir for cpu * working with cpu * update cpu strategy, relay op also working * fix cpplint * fixing pylint * enable gpu test for onnx nms * tweak parallel * pyformat and lint * fix relay nms test * doc update for cpp relay * updating tests * updated tests * fix converting score_threshold to Expr * update doc * doc fix Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>

masahi force-pushed the all-class-nms-final branch from da3eaf9 to b16f457 Compare April 5, 2021 06:25

masahi force-pushed the all-class-nms-final branch from b04d4dc to af5f821 Compare April 6, 2021 00:54

masahi changed the title ~~[TOPI, Relay] A new NMS op variant for ONNX / TF Combined NMS~~ [TOPI, Relay] A new NMS op variant for ONNX NMS / TF Combined NMS Apr 7, 2021

masahi and others added 18 commits April 8, 2021 17:30

initial import

64f5d50

add c++ boilarplate

d980761

add python boilarpolate

a1c3bf6

update onnx frontend

8af8079

fixing

d26d5b9

update onnx frontend

0c71339

fix shape

a40337a

minor update

71370cf

fix

15d3bd0

fix shape func

837ce76

fix for no box

e26bb4d

more fix

65b5bba

made things 64 bit

253629a

int64 tweak

9cb2505

max_output_size doesn't need to be a callback

ac5d79b

remove all_class_nms schedule

fd868a1

minor simplify

adaaf50

remove expand_dim

83aa4c2

fix relay nms test

2361321

masahi force-pushed the all-class-nms-final branch from af5f821 to 2361321 Compare April 8, 2021 08:31

masahi added 4 commits April 8, 2021 18:31

doc update for cpp relay

004145a

updating tests

d207c4d

updated tests

05fa415

fix converting score_threshold to Expr

6d314de

masahi force-pushed the all-class-nms-final branch from f71e619 to 6d314de Compare April 8, 2021 10:48

tqchen assigned mbrookhart Apr 9, 2021

tqchen added the status: need review label Apr 9, 2021

update doc

56531f7

masahi marked this pull request as ready for review April 9, 2021 23:16

doc fix

b174927

mbrookhart approved these changes Apr 13, 2021

View reviewed changes

trevor-m approved these changes Apr 14, 2021

View reviewed changes

masahi merged commit 390b4d1 into apache:main Apr 14, 2021

masahi mentioned this pull request Jun 2, 2021

[Relay, TF] Support converting TF combined_nms using Relay all_class_nms #8174

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI, Relay] A new NMS op variant for ONNX NMS / TF Combined NMS #7796

[TOPI, Relay] A new NMS op variant for ONNX NMS / TF Combined NMS #7796

masahi commented Apr 5, 2021 •

edited

Loading

mbrookhart commented Apr 5, 2021

tqchen commented Apr 5, 2021 •

edited

Loading

masahi commented Apr 5, 2021

masahi commented Apr 5, 2021 •

edited

Loading

tqchen commented Apr 6, 2021

masahi commented Apr 7, 2021 •

edited

Loading

electriclilies commented Apr 7, 2021

masahi commented Apr 7, 2021

mbrookhart commented Apr 7, 2021

trevor-m commented Apr 8, 2021 •

edited

Loading

masahi commented Apr 9, 2021 •

edited

Loading

masahi commented Apr 9, 2021

mbrookhart left a comment

masahi commented Apr 13, 2021

mbrookhart commented Apr 13, 2021

masahi commented Apr 13, 2021

trevor-m commented Apr 14, 2021

masahi commented Apr 14, 2021

[TOPI, Relay] A new NMS op variant for ONNX NMS / TF Combined NMS #7796

[TOPI, Relay] A new NMS op variant for ONNX NMS / TF Combined NMS #7796

Conversation

masahi commented Apr 5, 2021 • edited Loading

Implementation details

mbrookhart commented Apr 5, 2021

tqchen commented Apr 5, 2021 • edited Loading

masahi commented Apr 5, 2021

masahi commented Apr 5, 2021 • edited Loading

tqchen commented Apr 6, 2021

masahi commented Apr 7, 2021 • edited Loading

electriclilies commented Apr 7, 2021

masahi commented Apr 7, 2021

mbrookhart commented Apr 7, 2021

trevor-m commented Apr 8, 2021 • edited Loading

masahi commented Apr 9, 2021 • edited Loading

masahi commented Apr 9, 2021

mbrookhart left a comment

Choose a reason for hiding this comment

masahi commented Apr 13, 2021

mbrookhart commented Apr 13, 2021

masahi commented Apr 13, 2021

trevor-m commented Apr 14, 2021

masahi commented Apr 14, 2021

masahi commented Apr 5, 2021 •

edited

Loading

tqchen commented Apr 5, 2021 •

edited

Loading

masahi commented Apr 5, 2021 •

edited

Loading

masahi commented Apr 7, 2021 •

edited

Loading

trevor-m commented Apr 8, 2021 •

edited

Loading

masahi commented Apr 9, 2021 •

edited

Loading