-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-Maximum Supression on the GPU #392
Comments
that's definitely interesting. We should add it into pytorch, until we figure out ATen extensions. |
Great, I'll read over that PR for starters. |
This is all ready now. Where in the Python API would you like it go? I actually have two implementations ready; one is quicker in principle but it requires allocating global memory to store intermediate results, and freeing the memory at the end of each call completely kills performance. Is there anything one can do about this? If not, the other implementation (which doesn't allocate any intermediate global memory) is still fine and gives the kind of speedup I mentioned anyway, so it's no big deal. |
Hi, is this going to be added soon? (I'm asking since pytorch/pytorch/#5404 is closed now) |
Yes, I think we should add efficient NMS in here, following the Cpp extensions that were recently added. |
@fmassa That's fine anyway. |
Yes, once pytorch/pytorch#5673 is addressed |
Thanks! |
Whats the current status of this issue? |
@ruotianluo I'm drafting the layers part of torchvision in https://github.com/pytorch/vision/tree/layers |
When I build my faster-rcnn project, I ported the nms from py-faster-rcnn; the main reason is I'm not familiar with cuda code and it's simpler to just copy. |
Thanks, I'll take that into account when porting NMS to GPU. |
On the other hand, the py-faster-rcnn version throws data back and fourth between GPU and CPU, which may not be ideal. Also it requires a workspace of size O(n_boxes^2), and dynamically allocating/freeing this messes with performance.
|
We can use the PyTorch caching allocator to handle memory allocation (which will save us from the sync points from freeing the memory). |
Cool, in that case you might also be interested in the implementation here, https://github.com/dssa56/projects/tree/master/nms under cuda_nms_workspace.cu. That uses the same technique as py-faster-rcnn, but keeps all data on the GPU. I didn't use it in the PR because of the workspace, but if that's a non-issue then it's potentially worth a look. |
Has the current version of Torchvision the GPU-based NMS? |
No, it doesn't. |
@fmassa, |
@fmassa Can I do you a favor if you need for implementing the GPU-based NMS ? I'm also working on it. I'm also using the GPU-based RoI align and RoI pool from your torchvision branch. |
I have a version of it that I adapted from https://github.com/rbgirshick/py-faster-rcnn/blob/master/lib/nms/nms_kernel.cu |
@fmassa Thank you. Actually, I prefer to try it out, then consider how to fix the breakage in my production env. Looking forward to the gist version. |
@Sucran here is the implementation https://gist.github.com/fmassa/cf9ab87e4bd71e849655d5abb8cfd6c6 |
FYI, we have released our implementation of {Faster, Mask} R-CNN in https://github.com/facebookresearch/maskrcnn-benchmark , which contains a CPU and CUDA implementation for Non Maximum suppression. I suggest we move this discussion there for now. |
@fmassa, Why close this issue though? |
I'm curious in which cases do people use NMS outside of object detection? But I'm opening it anyway as a reminder |
#826 has been merged and adds support for NMS on both CPU and GPU |
FYI, this is located at |
It would be nice to mention the format for the boxes in the docstring, though. Is it |
It should be |
@fmassa @neighthan I've made the changes in #1110. It should be quick to review since it is mostly doc changes. |
@neighthan |
@varunagrawal @fmassa Can anyone help me what should be used now to run NMS (cuda) in batch mode while training? The NMS implementation provided by torchvision.ops is not in Batch mode and will be slow to use at training time. |
@dishank-b what do you mean by batch mode? Also note that we have a |
@fmassa By batch mode, I mean taking input in the form of Also, the |
ONNX implementation of NMS is generic enough, I wish torchvision had something similar. |
@dishank-b @dashesy The limitation with having NMS taking def batched_nms(boxes, scores, iou_threshold):
# boxes is a [batch_size, N, 4] tensor, and scores a
# [batch_size, N] tensor.
batch_size, N, _ = boxes.shape
indices = torch.arange(batch_size, device=boxes.device)
indices = indices[:, None].expand(batch_size, N).flatten()
boxes_flat = boxes.flatten(0, 1)
scores_flat = scores.flatten()
indices_flat = torchvision.ops.boxes.batched_nms(
boxes_flat, scores_flat, indices, iou_threshold)
# now reshape the indices as you want, maybe
# projecting back to the [batch_size, N] space
# I'm omitting this here
indices = indices_flat
return indices Let me know if you have questions |
@fmassa Thanks, this seems to work with batch either in images or class. However what if we have both the cases i.e do the image batching as well as classes? |
@dishank-b in that case you can squash the 2 dimensions together with view boxes = boxes.view(-1, N, 4) Since they are independent, it would not matter if they are from different batch, or different class then. Still would be nice if nms had a |
@dashesy I am not sure that would work because let's say there are two boxes in two different images but of the same category, now if they are overlapping then they would be merged using above method, which should not be the case as they are from different images. Let me know if I am wrong somewhere. |
I think they will not be merged. |
@dashesy Yeah, you are right, I get it. I was kind of doing the same but using the indices. Thanks your's way is easier. |
This makes a few assumptions though, which might not always be desired. We need to either remove some boxes via a criterion (low scores? small boxes? something else?) or pad with zeros to the maximum size, which is not very convenient. I considered those cases when implementing |
@fmassa Is or will there be any implementation of Soft-NMS in pytorch too? |
@fortunex3000
it's not in the plans for torchvision in the near future |
Soft-NMS is pretty straightforward. I might add it once academic winter break starts. |
Is the documentation/code for torchvision.ops.boxes.batched_nms correct? The documentation suggests that the nms iou threshold discards boxes with iou < iou threshold (i.e. it only discards boxes that don't overlap which would normally be considered unique). My own testing of the function showed that the function does give fewer boxes with a smaller threshold (0.0 gave 1 box per image) and gives all boxes back with a threshold of 1.0. Is there something I am not understanding about the implementation or the documentation? This seems to be a non-standard choice as you should want to keep boxes with a small iou (i.e. less overlap and more likely to be non unique) and not discard them. See the soft-nms paper Figure 2 comparing NMS and soft-NMS for how it seems to me like it should behave. http://www.cs.umd.edu/~bharat/snms.pdf |
@rmcavoy there is an error in the documentation. vision/torchvision/ops/boxes.py Line 27 in c05da2a
should read
I've sent a PR fixing the documentation in #1614, thanks! |
@fmassa how can I flatten back the indices_flat to [batch_size, N] if each image has different number of valid boxes after nms. |
@Edwardmark can you open a new issue describing what you are trying to do? |
@fmassa, I think @Edwardmark wants to ask how to unflatten the |
@fmassa @dishank-b Yes, and I figure it out. We can unflatten the indices based on the returned indices range. For example, indices within range(0, N) belongs to img1, and indices within range(N, 2N) belongs to img2. Thank you all very much. |
@Edwardmark. How did you unflatten the indices efficiently, without looping over each returned indices value and check if in range ... ? |
Is there any interest in an NMS layer that runs on the GPU for torchvision? I have one implemented; it gives a 1-2 order of magnitude speedup over a naive version composed from pytorch ops. Would be happy to contribute it if anyone's interested.
The text was updated successfully, but these errors were encountered: