Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI, Relay] A new NMS op variant for ONNX NMS / TF Combined NMS #7796

Merged
merged 36 commits into from
Apr 14, 2021

Conversation

masahi
Copy link
Member

@masahi masahi commented Apr 5, 2021

This PR adds a new variant of NMS that better supports NMS in ONNX and combined NMS in TF than the existing NMS operator in TOPI/Relay. For now, I'm calling this variant of NMS "All class NMS".

https://github.com/onnx/onnx/blob/master/docs/Operators.md#NonMaxSuppression
https://www.tensorflow.org/api_docs/python/tf/image/combined_non_max_suppression

The biggest difference between our NMS and "All class NMS" is that in our NMS, a single box is associated with a single class, while in the latter case a single box has scores for all classes, and NMS is performed for each class separately. max_out_size parameter is also applied per class, rather than to all boxes as in our implementation.

Until now, we've been supporting this variant of NMS via complicated encodings using our implementation of NMS. It kind of "works" in practice, but there are many problems to it:

For these reasons, I decided it is better to introduce a new variant of NMS to overcome these pains. This breaks our general "one operator to support all frameworks" philosophy, but the two variants of NMS are so different it doesn't make sense to call them the same op.

The result is significant: using the new NMS, I got speedup of 1 second on mlperf SSD resnet34, running on vk + amd. This is an extreme case in that the existing approach that calls get_valid_counts and non_maximum_surpression 80 times in a while loop is extremely slow on vk + amd for some reason, taking literally 1 second. Now it is only 5.7 milli second.

Implementation details

The new NMS implementation consists of the following steps:

  • Sort scores and return both sorted scores and indices. The existing NMS only uses sorted indices, while I also used sorted scores to do binary search, next.
  • Do binary search on sorted scores to find the index of the box whose score is just below score_threshold. This gives what we call valid_count[i] in the existing NMS, computed by get_valid_counts.
  • Do NMS, parallelized over batch * class. The inner loop uses the same NMS IR as the existing one.
  • After the previous step, we end up with indices of size (batch * num_class, num_boxes) and a tensor num_detections of size (batch * num_class,) holding the number of survived boxes per row. We need to copy num_detections[i] indices from each row into a one linear output. This is a perfect application of exclusive scan: Doing ex scan on num_detections gives row offsets to write into for each row.

The efficiency of the new implementation comes from:

  • get_valid_count call is replaced with binary search
  • Per class independent NMS are done in parallel across different blocks on GPU. This alone gives num_classx speedup over the existing encoding in ONNX / TF frontend.
  • The ONNX NMS frontend is now trivial and there is no triple nested loop etc. It seems overhead on the host side is very large on vk + amd.

@mbrookhart
Copy link
Contributor

Exciting! I won't review too much because it's still in draft, but the overall structure looks good to me. I agree that that relay-level loop was a problem since we couldn't parallelize it.

I'm seeing some removed attributes from the normal NMS op, I didn't add those as part of the ONNX NMS PR, I imagine they're used by other frontends?

@tqchen
Copy link
Member

tqchen commented Apr 5, 2021

awesome. it would be great to also discuss the naming of the API(by checking with the naming in the existing libraries e.g. TF, Pytorch). e.g. shall we call it combined_non_max_suppression as per TF's naming convention?

@masahi
Copy link
Member Author

masahi commented Apr 5, 2021

I'm seeing some removed attributes from the normal NMS op, I didn't add those as part of the ONNX NMS PR, I imagine they're used by other frontends?

@mbrookhart These attributes are now runtime arguments and they are not used anymore, see

auto attrs = make_object<NonMaximumSuppressionAttrs>();
attrs->force_suppress = force_suppress;
attrs->top_k = top_k;
attrs->coord_start = coord_start;
attrs->score_index = score_index;
attrs->id_index = id_index;
attrs->return_indices = return_indices;
attrs->invalid_to_bottom = invalid_to_bottom;
To avoid confusion I removed them.

@masahi
Copy link
Member Author

masahi commented Apr 5, 2021

awesome. it would be great to also discuss the naming of the API(by checking with the naming in the existing libraries e.g. TF, Pytorch). e.g. shall we call it combined_non_max_suppression as per TF's naming convention?

combined_non_max_suppression was also the name I started with, but I didn't particularly like it because it doesn't tell what exactly is "combined".

Moreover, in TF the other NMS op is the standard, single class NMS, so for them combined or not is really single vs multiclass distinction. In contrast, our existing NMS also does multiclass NMS where a single box is associated with a single class. This variant of multiclass NMS has a different notion of "multiclass" and can support PyTorch and MXNet. So I don't think combined_nms is a good choice for us.

all_class_non_maximum_surpression is also what TensorRT calls it https://github.com/NVIDIA/TensorRT/blob/master/plugin/common/kernels/allClassNMS.cu I don't have a better alternative than this.

@tqchen
Copy link
Member

tqchen commented Apr 6, 2021

Thanks @masahi , can you do a quick round of search of pytorch, onnx, tensorrt and others and summarize the naming rationale? The main thing is to write down why we pick the name, I am less attached to a particular name.

@masahi
Copy link
Member Author

masahi commented Apr 7, 2021

ok I see roughly three categories of NMS used in frameworks:

  1. MXNet and TVM https://mxnet.apache.org/versions/1.7.0/api/python/docs/api/ndarray/contrib/index.html#mxnet.ndarray.contrib.box_nms This is what MXNet calls box_nms, this API is highly non-standard but unfortunately this is what we inherited from for topi/relay. It supports multiclass NMS but a single box is associated with only one class, unlike the third category below.

  2. PT torchvision.ops.nms and TF non_max_suppression, this is the standard, single class NMS.
    https://pytorch.org/vision/stable/ops.html#torchvision.ops.nms
    https://www.tensorflow.org/api_docs/python/tf/image/non_max_suppression

  3. ONNX NonMaxSuppression, TF combined_non_max_suppression, and TensorRT batchedNMSPlugin (the implementation calls it allClassNMS) This is the variant of multi class NMS where a single box can be selected multiple times per different classes.
    https://github.com/onnx/onnx/blob/master/docs/Operators.md#NonMaxSuppression
    https://www.tensorflow.org/api_docs/python/tf/image/combined_non_max_suppression
    https://github.com/NVIDIA/TensorRT/tree/master/plugin/batchedNMSPlugin, https://github.com/NVIDIA/TensorRT/blob/master/plugin/common/kernels/allClassNMS.cu

So the bottom line is, I think it is reasonable to say that NMS, without adjectives, should refer to the single class variant (category 2 above), but there isn't a consensus on what to call the third category one which this PR is about.

I think all_class_non_maximum_surpression or per_class_non_maximum_surpression are the most descriptive of what it does. Either one is fine for me and I'm open to other suggestions @mbrookhart @jwfromm

@masahi masahi changed the title [TOPI, Relay] A new NMS op variant for ONNX / TF Combined NMS [TOPI, Relay] A new NMS op variant for ONNX NMS / TF Combined NMS Apr 7, 2021
@electriclilies
Copy link
Contributor

I think that the name per_class_non_maximum_supression implies that there is one class label per bounding box, but the ONNX version allows a box to be selected by multiple classes, so I prefer all_class_non_maximum_supression over per_class_non_maximum_supression

@masahi
Copy link
Member Author

masahi commented Apr 7, 2021

By per_class I meant we do NMS for each class separately (unlike our existing NMS), but if this is already confusing then yes, all_class_non_maximum_supression might be better.

@mbrookhart
Copy link
Contributor

I think that sounds good.

@trevor-m
Copy link
Contributor

trevor-m commented Apr 8, 2021

Really nice! Just did first pass through the code and overall it looks good.

A couple of thoughts/questions:

  1. TF Combined NMS allows for both a) using same boxes for all classes and b) using different boxes for each class.
    For a) the shape of boxes input is (batch_size, 1, num_boxes, 4).
    For b) the shape of boxes input is (batch_size, num_classes, num_anchors, 4)
.
    Currently, your implementation only supports a) and has a box input shape of (batch_size, num_anchors, 4).
    I think all of the uses of combinedNMS I have seen in TF have only used a), but perhaps in the future we may want to also add support for b).
  2. This is regarding the output format. One common thing after getting the selected indices is using those to gather the coordinates and scores of the selected boxes.
    Since the selected indices are in the format [batch_id, class_id, index] we can actually use them directly to index into the scores tensors which has shape (batch, num_classes, num_boxes) to get the selected scores.
    But for the selected boxes, we need to slice out the batch_id and index and concat to get them into the format [batch_id, index] before we can index into boxes tensor to get selected boxes.
    Is my understanding here correct? I'm wondering if we can improve this somehow.

@masahi
Copy link
Member Author

masahi commented Apr 9, 2021

Thanks @trevor-m

  1. Yes, we should be able to support such variant easily. The only place we touch box coords is here (i is the batch or row index, j and k are the index of sorted scores)

    def calc_overlap(i, j, k):
    offset_j = sorted_indices[i, j] * 4
    offset_k = sorted_indices[i, k] * 4
    batch_id = i // num_class
    base_bbox_idx = batch_id * num_anchors * 4
    return calculate_overlap(
    boxes,
    base_bbox_idx + offset_j,
    base_bbox_idx + offset_k,
    )
    we can do arbitrary indirect indexing into boxes. We might need to update calculate_overlap function but that would be easy as well.

  2. Yes, the new NMS will return indices, and it would be the job of frontend importer to do the gather. For TF, in particular, the need to return per batch output would be a bit annoying. We can discuss what format of output would be most convenient for TF, and add ret_type attribute to specify which of ONNX or TF "mode" we return indices in.

@masahi masahi marked this pull request as ready for review April 9, 2021 23:16
@masahi
Copy link
Member Author

masahi commented Apr 9, 2021

@mbrookhart @jwfromm @trevor-m @Laurawly ready for review

Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I have two minor nitpicks:

  1. The ONNX op has a center_point_box attribute to switch between TF-style boxes and Pytorch-style boxes. I never implemented the pytorch style, but I don't think we need to change this kernel, we should be able to transform them in the future.
  2. It seems like we are now using exactly the same NMS tests for:
    The ONNX Node tests I'm automatically importing
    The manual ONNX importer tests
    The relay tests
    The topi tests

This feels...redundant. Do we need the same test at every level?

@masahi
Copy link
Member Author

masahi commented Apr 13, 2021

Yes it is redundant, but it's simply too tedious to come up with good test cases for this op. Do you have any suggestion on how to go about this? Ideally I want to test on real workload, a small artificial test case like that doesn't really exercise all the tricky aspects in our NMS kernel.

@mbrookhart
Copy link
Contributor

Honestly, I'd be okay with just dropping the topi tests, they seem especially redundant after relay. Perhaps there's a TF test or two we could include with the TF importer? I'm planning on dropping the onnx tests once I get the imported node tests working on more backends.

Again, this a nitpick, I don't think it needs to be in this PR, I just think we generally have too much redundancy in testing and not enough coverage

@masahi
Copy link
Member Author

masahi commented Apr 13, 2021

Yes, the way it always works is we first implement topi op with tests, and then relay op. So topi tests always come first. Two tests are indeed redundant and ideally we should have difference test cases. But I'd say it is also weird to delete topi tests on purpose after we have Relay tests. It doesn't hurt to leave topi tests anyway.

@trevor-m
Copy link
Contributor

Thanks @masahi
Once this is merged I will open a PR for the TF frontend

@masahi masahi merged commit 390b4d1 into apache:main Apr 14, 2021
@masahi
Copy link
Member Author

masahi commented Apr 14, 2021

mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Apr 22, 2021
…ache#7796)

* initial import

* add c++ boilarplate

* add python boilarpolate

* update onnx frontend

* fixing

* update onnx frontend

* fix shape

* minor update

* fix

* fix shape func

* fix for no box

* more fix

* made things 64 bit

* int64 tweak

* max_output_size doesn't need to be a callback

* remove all_class_nms schedule

* minor simplify

* remove expand_dim

* refactoring

* simplify nms loop

* cpu all_class_nms stub

* updating ir for cpu

* working with cpu

* update cpu strategy, relay op also working

* fix cpplint

* fixing pylint

* enable gpu test for onnx nms

* tweak parallel

* pyformat and lint

* fix relay nms test

* doc update for cpp relay

* updating tests

* updated tests

* fix converting score_threshold to Expr

* update doc

* doc fix

Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
…ache#7796)

* initial import

* add c++ boilarplate

* add python boilarpolate

* update onnx frontend

* fixing

* update onnx frontend

* fix shape

* minor update

* fix

* fix shape func

* fix for no box

* more fix

* made things 64 bit

* int64 tweak

* max_output_size doesn't need to be a callback

* remove all_class_nms schedule

* minor simplify

* remove expand_dim

* refactoring

* simplify nms loop

* cpu all_class_nms stub

* updating ir for cpu

* working with cpu

* update cpu strategy, relay op also working

* fix cpplint

* fixing pylint

* enable gpu test for onnx nms

* tweak parallel

* pyformat and lint

* fix relay nms test

* doc update for cpp relay

* updating tests

* updated tests

* fix converting score_threshold to Expr

* update doc

* doc fix

Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
…ache#7796)

* initial import

* add c++ boilarplate

* add python boilarpolate

* update onnx frontend

* fixing

* update onnx frontend

* fix shape

* minor update

* fix

* fix shape func

* fix for no box

* more fix

* made things 64 bit

* int64 tweak

* max_output_size doesn't need to be a callback

* remove all_class_nms schedule

* minor simplify

* remove expand_dim

* refactoring

* simplify nms loop

* cpu all_class_nms stub

* updating ir for cpu

* working with cpu

* update cpu strategy, relay op also working

* fix cpplint

* fixing pylint

* enable gpu test for onnx nms

* tweak parallel

* pyformat and lint

* fix relay nms test

* doc update for cpp relay

* updating tests

* updated tests

* fix converting score_threshold to Expr

* update doc

* doc fix

Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
…ache#7796)

* initial import

* add c++ boilarplate

* add python boilarpolate

* update onnx frontend

* fixing

* update onnx frontend

* fix shape

* minor update

* fix

* fix shape func

* fix for no box

* more fix

* made things 64 bit

* int64 tweak

* max_output_size doesn't need to be a callback

* remove all_class_nms schedule

* minor simplify

* remove expand_dim

* refactoring

* simplify nms loop

* cpu all_class_nms stub

* updating ir for cpu

* working with cpu

* update cpu strategy, relay op also working

* fix cpplint

* fixing pylint

* enable gpu test for onnx nms

* tweak parallel

* pyformat and lint

* fix relay nms test

* doc update for cpp relay

* updating tests

* updated tests

* fix converting score_threshold to Expr

* update doc

* doc fix

Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request May 11, 2021
…ache#7796)

* initial import

* add c++ boilarplate

* add python boilarpolate

* update onnx frontend

* fixing

* update onnx frontend

* fix shape

* minor update

* fix

* fix shape func

* fix for no box

* more fix

* made things 64 bit

* int64 tweak

* max_output_size doesn't need to be a callback

* remove all_class_nms schedule

* minor simplify

* remove expand_dim

* refactoring

* simplify nms loop

* cpu all_class_nms stub

* updating ir for cpu

* working with cpu

* update cpu strategy, relay op also working

* fix cpplint

* fixing pylint

* enable gpu test for onnx nms

* tweak parallel

* pyformat and lint

* fix relay nms test

* doc update for cpp relay

* updating tests

* updated tests

* fix converting score_threshold to Expr

* update doc

* doc fix

Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants