-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Quantized version of RoIAlign #3624
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ready for review - CI failures are due to #3631
quick benchmark on 30k boxes:
roialign float 19.7 ms ± 652 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
qroialign torch.qint8 29.7 ms ± 794 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
qroialign torch.quint8 29.8 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
qroialign torch.qint32 29.9 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
torchvision/csrc/ops/roi_align.h
Outdated
@@ -30,6 +30,123 @@ at::Tensor _roi_align_backward( | |||
int64_t sampling_ratio, | |||
bool aligned); | |||
|
|||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved the common code in the header, and added some comments / removed useless parameters. The rest of the code is unchanged.
Maybe there's a better place to put common code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The header files located at the root of /ops/
folder contain the "public" methods for C++ API while the cpp files contain the code necessary for calling the dispatcher and registering the operators. They are not meant to contain code of the actual implementation of the method.
Up until now, there was very little need to share code between the CPU and CUDA implementations and as a result you won't find in torchvision's codebase an example that does what you want to do. One approach is to follow the pattern used at PyTorch and move the common code is separate files in the /ops/cpu/
folder. See the Loops.h file which is used both by CPU kernels and quantized.
namespace { | ||
|
||
template <typename T> | ||
void qroi_align_forward_kernel_impl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewing this file might be easier by doing a vimdiff
or similar with ops/cpu/roi_align_kernel.cpp
torchvision/csrc/ops/roi_align.h
Outdated
int roi_bin_grid_h, | ||
int roi_bin_grid_w, | ||
std::vector<PreCalc<T>>& pre_calc) { | ||
int pre_calc_index = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also consider moving the actual implementation of lengthy methods to separate cpp file.
int roi_batch_ind = at::native::dequantize_val( | ||
rois_scale, rois_zp, offset_rois[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might have issues here with large images and batches.
Imagine that the image is of size 1000x1000, and the boxes thus can be from 0 to 1000 in value. The quantized boxes (which have the same quantizers for all boxes) would not be able to properly index the 0, 1, 2, etc values needed for the image index in the batch.
I think the solution for this is either
- [easy] disable batch support altogether in the quantized version of the op
- [medium] try using per-channel quantization of the boxes (so that the quantizer for the indices is different than the quantizer for the coordinates) (need to see how it integrates with the rest of the pipeline)
- [harder] break BC in the kernels and pass instead a list of tensors for the rois. This would maybe involve more changes down the road in the Faster R-CNN model and would probably be tricky, so I wouldn't recommend this.
Additionally, can you also add a quick test exercising this code-path that I mentioned (with a large image and a few batch elements)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if I understand your concern correctly, this is similar to my previous #3624 (comment):
we can't represent 1 with (scale, zero_point) = (2, 0)
I pushed a check in d6f78ab as we discussed yesterday, but I realize now that it's not correct/enough anyway: it assumes that (scale, zero_point) == (1, 0) which is wrong in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
Can you open an issue to track adding support for N > 1 for quantized tensors?
Summary: * WIP * clang * docs * extracted out common utils * Use better quantization function and pass tensors as parameters * proper dequantization * Some tests * Dequantization optimization, seems to gain a few ms * clang-format * again * more correct test. Had to remove optimization although it almost works * Also test aligned=True * remove useless part * more docs and comments * Put back optimization with more robust test * Added check for index upper bound * avoid possible overflow * Move common function into common.h * oops * scale=1,zero_point=0 makes more sense * Force batch size of 1 to prevent any indexingbug * format * format again * updated docstring * put back description comment for pre_calc_bilinear_interpolate * revert most changes to docstring as it's taken care of in another PR Reviewed By: NicolasHug Differential Revision: D27706946 fbshipit-source-id: 2ae1614c214ea676b4f7705dc0716efd9f34330e
This PR implements support for quantized integers for the forward pass of RoIAlign on CPU.