-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Torch] More graph rewrites for Faster RCNN / MaskRCNN #7346
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, it looks good to me.
I have one question regarding the topk_after_batch_nms_pattern
. It seems the topk slice will no longer get applied to the true branch. Before rewrite, it was applied to the result of the if statement - both branches. After rewrite, it is folded into NMS via max_output_size, but that is only in the false branch. Would that cause problems?
No, the So applying topk to an empty tensor is nop anyway. |
Got it, thanks! I guess the pattern does not guarantee that the true branch is for that 0 box case, but since this rewrite is only meant to be used for this particular model it is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the effort. lgtm. just a nitpick, feel free to ignore.
* add post nms topk to max_out_size rewrite * add argsort conversion * scatter pattern first cut * matching seems to working * dup matching fixed * add converter * conversion seems working * add reshape, use take * remove pytorch argsort converter * update test * add doc
* add post nms topk to max_out_size rewrite * add argsort conversion * scatter pattern first cut * matching seems to working * dup matching fixed * add converter * conversion seems working * add reshape, use take * remove pytorch argsort converter * update test * add doc
* add post nms topk to max_out_size rewrite * add argsort conversion * scatter pattern first cut * matching seems to working * dup matching fixed * add converter * conversion seems working * add reshape, use take * remove pytorch argsort converter * update test * add doc
* add post nms topk to max_out_size rewrite * add argsort conversion * scatter pattern first cut * matching seems to working * dup matching fixed * add converter * conversion seems working * add reshape, use take * remove pytorch argsort converter * update test * add doc
This PR adds two new graph rewrite to optimize Faster RCNN / MaskRCNN. Happy to split them into two PRs if preferred.
The first one is to exploit the fact that in PyTorch detection models, NMS is always followed by post NMS topk, as shown below.
https://github.com/pytorch/vision/blob/8ebfd2f5d5f1792ce2cf5a2329320f604530a68e/torchvision/models/detection/rpn.py#L272-L275
We can extract that topk parameter and use it as
max_out_size
parameter in our NMS. This brings a good speed up 4.51 milli sec -> 4.11 milli sec, and further speed up is easily expected if we had TIR while loop (cc @tqchen)The second is to replace the repeated scatter loop in
https://github.com/pytorch/vision/blob/6315358dd06e3a2bcbe9c1e8cdaa10898ac2b308/torchvision/ops/poolers.py#L20-L29
with something like this:
i.e., we are able to remove
torch.zeros
(which turns out very expensive, due to too muchany_dim
generated by Relay) and repeated 4D scatters (which is slow because scatters cannot be parallelized well). Instead, we can do concat, argsort, and batched gather to get equivalent result, which is much more efficient. This transformation is not at all obvious, I think this is a great example of the power of graph rewrite. It cuts more than 10 milli seconds from MaskRCNN / FasterRCNN.Unfortunately I expect this PR is hard to review, let me know if you have any questions. I tried to give detailed comments to aid understanding.
This concludes the series of PRs I did to optimize MaskRCNN on GPU + VM, here is the current numbers. Surprisingly, NVPTX generates much better code for the dynamic injective ops, which is one of the bottlenecks in MaskRCNN due to a certain limitation in Relay + TE (too many unnecessary
any_dim
generated). I hope we can discuss this performance result further in the forum.please review @zhiics @kevinthesun @mbrookhart @jwfromm @anijain2305 @trevor-m