Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-Maximum Supression on the GPU #392

Closed
ghost opened this issue Jan 17, 2018 · 61 comments
Closed

Non-Maximum Supression on the GPU #392

ghost opened this issue Jan 17, 2018 · 61 comments

Comments

@ghost
Copy link

ghost commented Jan 17, 2018

Is there any interest in an NMS layer that runs on the GPU for torchvision? I have one implemented; it gives a 1-2 order of magnitude speedup over a naive version composed from pytorch ops. Would be happy to contribute it if anyone's interested.

@soumith
Copy link
Member

soumith commented Jan 17, 2018

that's definitely interesting. We should add it into pytorch, until we figure out ATen extensions.
For example like how we added ROIPooling: pytorch/pytorch#3672

@ghost
Copy link
Author

ghost commented Jan 17, 2018

Great, I'll read over that PR for starters.

@ghost
Copy link
Author

ghost commented Feb 20, 2018

This is all ready now. Where in the Python API would you like it go? I actually have two implementations ready; one is quicker in principle but it requires allocating global memory to store intermediate results, and freeing the memory at the end of each call completely kills performance. Is there anything one can do about this? If not, the other implementation (which doesn't allocate any intermediate global memory) is still fine and gives the kind of speedup I mentioned anyway, so it's no big deal.

@aosokin
Copy link
Contributor

aosokin commented Mar 13, 2018

Hi, is this going to be added soon? (I'm asking since pytorch/pytorch/#5404 is closed now)

@fmassa
Copy link
Member

fmassa commented Mar 14, 2018

Yes, I think we should add efficient NMS in here, following the Cpp extensions that were recently added.
Because Cpp extensions are only present in master, I'd wait until 0.4.0 is released before merging any C++ code into torchvision.
How does that sound?

@aosokin
Copy link
Contributor

aosokin commented Mar 14, 2018

@fmassa That's fine anyway.
Btw, is 0.4 coming soon? :-)

@fmassa
Copy link
Member

fmassa commented Mar 14, 2018

Yes, once pytorch/pytorch#5673 is addressed

@aosokin
Copy link
Contributor

aosokin commented Mar 14, 2018

Thanks!

@ruotianluo
Copy link

Whats the current status of this issue?

@fmassa
Copy link
Member

fmassa commented Apr 26, 2018

@ruotianluo I'm drafting the layers part of torchvision in https://github.com/pytorch/vision/tree/layers
I've added a CPU version of NMS, and I'll look into porting one of the following versions: from py-faster-rcnn, from this PR and a few others.
Have you tried any of those?

@ruotianluo
Copy link

ruotianluo commented Apr 26, 2018

When I build my faster-rcnn project, I ported the nms from py-faster-rcnn; the main reason is I'm not familiar with cuda code and it's simpler to just copy.
With regard to https://github.com/pytorch/pytorch/pull/5404/files, it's using a full byte to save a boolean value(mask), while py-faster-rcnn uses each bit. (I can't comment on anything else, 'cause as I said I'm not familiar.)

@fmassa
Copy link
Member

fmassa commented Apr 27, 2018

Thanks, I'll take that into account when porting NMS to GPU.

@ghost
Copy link
Author

ghost commented Apr 27, 2018 via email

@fmassa
Copy link
Member

fmassa commented Apr 27, 2018

We can use the PyTorch caching allocator to handle memory allocation (which will save us from the sync points from freeing the memory).
I'll have a closer look at both implementations this weekend, but thanks for the feedbacks!

@ghost
Copy link
Author

ghost commented Apr 27, 2018

Cool, in that case you might also be interested in the implementation here, https://github.com/dssa56/projects/tree/master/nms under cuda_nms_workspace.cu. That uses the same technique as py-faster-rcnn, but keeps all data on the GPU. I didn't use it in the PR because of the workspace, but if that's a non-issue then it's potentially worth a look.

@ahkarami
Copy link

ahkarami commented Jul 1, 2018

Has the current version of Torchvision the GPU-based NMS?

@fmassa
Copy link
Member

fmassa commented Jul 3, 2018

No, it doesn't.
I have it implemented locally (based on the implementations that were shared previously), but I haven't pushed it yet.

@ahkarami
Copy link

ahkarami commented Jul 3, 2018

@fmassa,
Ok. Thank you for your reply. I hope the PyTorch team push it as soon as possible.

@Sucran
Copy link

Sucran commented Jul 27, 2018

@fmassa Can I do you a favor if you need for implementing the GPU-based NMS ? I'm also working on it. I'm also using the GPU-based RoI align and RoI pool from your torchvision branch.

@fmassa
Copy link
Member

fmassa commented Jul 27, 2018

I have a version of it that I adapted from https://github.com/rbgirshick/py-faster-rcnn/blob/master/lib/nms/nms_kernel.cu
But because pytorch is so rapidly evolving, it has already been broken by pytorch refactoring at least 3 times.
I'll hold on before pushing it to torchvision because I expect those breakage to still happen.
I can push the current version I have in a gist though, if you want to try it out

@Sucran
Copy link

Sucran commented Jul 27, 2018

@fmassa Thank you. Actually, I prefer to try it out, then consider how to fix the breakage in my production env. Looking forward to the gist version.

@fmassa
Copy link
Member

fmassa commented Jul 27, 2018

@Sucran here is the implementation https://gist.github.com/fmassa/cf9ab87e4bd71e849655d5abb8cfd6c6
It was working, but I'm not sure anymore.
Also, it required PyTorch compiled from source

@fmassa
Copy link
Member

fmassa commented Oct 25, 2018

FYI, we have released our implementation of {Faster, Mask} R-CNN in https://github.com/facebookresearch/maskrcnn-benchmark , which contains a CPU and CUDA implementation for Non Maximum suppression.

I suggest we move this discussion there for now.

@fmassa fmassa closed this as completed Oct 25, 2018
@0phoff
Copy link

0phoff commented Oct 25, 2018

@fmassa, Why close this issue though?
Many people use torchvision, but do not care about or need the maskrcnn repository.
I still think having a pure cuda implementation of NMS in torch/torchvision would be highly appreciated by lots of people!

@fmassa
Copy link
Member

fmassa commented Oct 25, 2018

I'm curious in which cases do people use NMS outside of object detection?
But at some point we might indeed add layers to torchvision, it's just that packaging torchvision with those is a bit more involved so I'll let it for when the Cpp extensions API is more stabilized.

But I'm opening it anyway as a reminder

@fmassa fmassa reopened this Oct 25, 2018
@fmassa
Copy link
Member

fmassa commented May 7, 2019

#826 has been merged and adds support for NMS on both CPU and GPU

@fmassa fmassa closed this as completed May 7, 2019
@neighthan
Copy link

FYI, this is located at torchvision.ops.nms.

@neighthan
Copy link

It would be nice to mention the format for the boxes in the docstring, though. Is it [y1, x1, y2, x2] for some diagonal corners, as in TensorFlow? Or perhaps just top-left and bottom-right corners?

@varunagrawal
Copy link
Contributor

It should be [x1, y1, x2, y2] but adding this to the docstring makes a lot of sense.

@varunagrawal
Copy link
Contributor

@fmassa @neighthan I've made the changes in #1110. It should be quick to review since it is mostly doc changes.

@dashesy
Copy link

dashesy commented Aug 1, 2019

@neighthan torchvision.ops.nms seems to also support CUDA? the only issue is that it does not support batch numbers, that would be another dimension to benefit from CUDA parallelisation.
I have this nms implementation that does and I use it to apply NMS on multiple images at once, each independently.

@dishank-b
Copy link

@varunagrawal @fmassa Can anyone help me what should be used now to run NMS (cuda) in batch mode while training? The NMS implementation provided by torchvision.ops is not in Batch mode and will be slow to use at training time.

@fmassa
Copy link
Member

fmassa commented Oct 23, 2019

@dishank-b what do you mean by batch mode?

Also note that we have a batched_nms implementation in torch.ops.batched_nms.

@dishank-b
Copy link

@fmassa By batch mode, I mean taking input in the form of [batch_size, N, 4]. The current torchvision.ops.nms takes the boxes in form [N,4] where N is the number of boxes in single Image. Whereas let's say during training Faster RCNN if your batch_size is 16, then we would like the NMS to run for those 16 images in a batch. Although there is a problem that N may not be the same for all 16 images.

Also, the torch.ops.batched_nms is different, it for the case when there is more than 1 classes to in the image.

@dashesy
Copy link

dashesy commented Oct 23, 2019

ONNX implementation of NMS is generic enough, I wish torchvision had something similar.
There is the batched-nms as described by @dishank-b and there is per-class NMS that is what torch.ops.batched_nms seems to do. ONNX implementation supports both cases using one API.

@fmassa
Copy link
Member

fmassa commented Oct 24, 2019

@dishank-b @dashesy torchvision.ops.batched_nms can be used to perform NMS in a batch, but it doesn't matter if it is per class or per image actually. It just performs NMS independently "per category", which can mean image, class, etc.

The limitation with having NMS taking [batch_size, N, 4] inputs is that you need to set a max number of boxes, and pad the images that do not have enough boxes with something (like zeros). This is actually a subset of what torchvision.ops.batched_nms can do. Let me show you an example of how to implement this NMS with torchvision.ops.batched_nms:

def batched_nms(boxes, scores, iou_threshold):
    # boxes is a [batch_size, N, 4] tensor, and scores a
    # [batch_size, N] tensor.
    batch_size, N, _ = boxes.shape
    indices = torch.arange(batch_size, device=boxes.device)
    indices = indices[:, None].expand(batch_size, N).flatten()
    boxes_flat = boxes.flatten(0, 1)
    scores_flat = scores.flatten()
    indices_flat = torchvision.ops.boxes.batched_nms(
        boxes_flat, scores_flat, indices, iou_threshold)
    # now reshape the indices as you want, maybe
    # projecting back to the [batch_size, N] space
    # I'm omitting this here
    indices = indices_flat
    return indices

Let me know if you have questions

@dishank-b
Copy link

@fmassa Thanks, this seems to work with batch either in images or class. However what if we have both the cases i.e do the image batching as well as classes?
I can think of a possible solution that we create indices = torch.arange(batch_size*classes) where each element of the indices will represent each class in each image of the batch.
But this seems quite naive to me, do you have any other way of doing the same?

@dashesy
Copy link

dashesy commented Oct 24, 2019

@dishank-b in that case you can squash the 2 dimensions together with view

boxes = boxes.view(-1, N, 4)

Since they are independent, it would not matter if they are from different batch, or different class then.

Still would be nice if nms had a max_boxes_per_batch parameter to avoid the zero padding.

@dishank-b
Copy link

@dashesy I am not sure that would work because let's say there are two boxes in two different images but of the same category, now if they are overlapping then they would be merged using above method, which should not be the case as they are from different images.

Let me know if I am wrong somewhere.

@dashesy
Copy link

dashesy commented Oct 24, 2019

I think they will not be merged.
N is the maximum number of boxes per-image-per-class. Now lets say you have 2 classes and 3 images then you have boxes with shape 3x2xNx4 after doing the view you will have 6xNx4 and first dimension will be done independently.

@dishank-b
Copy link

@dashesy Yeah, you are right, I get it. I was kind of doing the same but using the indices. Thanks your's way is easier.

@fmassa
Copy link
Member

fmassa commented Oct 25, 2019

@dashesy

Still would be nice if nms had a max_boxes_per_batch parameter to avoid the zero padding.

This makes a few assumptions though, which might not always be desired. We need to either remove some boxes via a criterion (low scores? small boxes? something else?) or pad with zeros to the maximum size, which is not very convenient.

I considered those cases when implementing batched_nms and Faster R-CNN, and I came up with what we currently have, which seems like a good compromise on flexibility and ease-of-use.

@MarkusAmann
Copy link

@fmassa Is or will there be any implementation of Soft-NMS in pytorch too?

@fmassa
Copy link
Member

fmassa commented Nov 13, 2019

@fortunex3000

Is or will there be any implementation of Soft-NMS in pytorch too?

it's not in the plans for torchvision in the near future

@varunagrawal
Copy link
Contributor

Soft-NMS is pretty straightforward. I might add it once academic winter break starts.

@rmcavoy
Copy link

rmcavoy commented Nov 24, 2019

Is the documentation/code for torchvision.ops.boxes.batched_nms correct? The documentation suggests that the nms iou threshold discards boxes with iou < iou threshold (i.e. it only discards boxes that don't overlap which would normally be considered unique). My own testing of the function showed that the function does give fewer boxes with a smaller threshold (0.0 gave 1 box per image) and gives all boxes back with a threshold of 1.0. Is there something I am not understanding about the implementation or the documentation? This seems to be a non-standard choice as you should want to keep boxes with a small iou (i.e. less overlap and more likely to be non unique) and not discard them. See the soft-nms paper Figure 2 comparing NMS and soft-NMS for how it seems to me like it should behave. http://www.cs.umd.edu/~bharat/snms.pdf

@fmassa
Copy link
Member

fmassa commented Nov 26, 2019

@rmcavoy there is an error in the documentation.

boxes with IoU < iou_threshold

should read

        boxes with IoU > iou_threshold

I've sent a PR fixing the documentation in #1614, thanks!

@Edwardmark
Copy link

@fmassa how can I flatten back the indices_flat to [batch_size, N] if each image has different number of valid boxes after nms.

@fmassa
Copy link
Member

fmassa commented Jan 8, 2020

@Edwardmark can you open a new issue describing what you are trying to do?

@dishank-b
Copy link

@dishank-b @dashesy torchvision.ops.batched_nms can be used to perform NMS in a batch, but it doesn't matter if it is per class or per image actually. It just performs NMS independently "per category", which can mean image, class, etc.

The limitation with having NMS taking [batch_size, N, 4] inputs is that you need to set a max number of boxes, and pad the images that do not have enough boxes with something (like zeros). This is actually a subset of what torchvision.ops.batched_nms can do. Let me show you an example of how to implement this NMS with torchvision.ops.batched_nms:

def batched_nms(boxes, scores, iou_threshold):
    # boxes is a [batch_size, N, 4] tensor, and scores a
    # [batch_size, N] tensor.
    batch_size, N, _ = boxes.shape
    indices = torch.arange(batch_size, device=boxes.device)
    indices = indices[:, None].expand(batch_size, N).flatten()
    boxes_flat = boxes.flatten(0, 1)
    scores_flat = scores.flatten()
    indices_flat = torchvision.ops.boxes.batched_nms(
        boxes_flat, scores_flat, indices, iou_threshold)
    # now reshape the indices as you want, maybe
    # projecting back to the [batch_size, N] space
    # I'm omitting this here
    indices = indices_flat
    return indices

Let me know if you have questions

@fmassa, I think @Edwardmark wants to ask how to unflatten the indices in above reply, so that you get the detections for each image, given you don't know the number of final detections in each image.

@Edwardmark
Copy link

@fmassa @dishank-b Yes, and I figure it out. We can unflatten the indices based on the returned indices range. For example, indices within range(0, N) belongs to img1, and indices within range(N, 2N) belongs to img2. Thank you all very much.

@lamhoangtung
Copy link

@Edwardmark. How did you unflatten the indices efficiently, without looping over each returned indices value and check if in range ... ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests