Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Quantized version of RoIAlign #3624

Merged
merged 32 commits into from
Apr 8, 2021
Merged

Conversation

NicolasHug
Copy link
Member

This PR implements support for quantized integers for the forward pass of RoIAlign on CPU.

@NicolasHug NicolasHug changed the title WIP Add Quantized version of RoIAlign Add Quantized version of RoIAlign Apr 4, 2021
@NicolasHug NicolasHug added improvement module: models.quantization Issues related to the quantizable/quantized models module: ops labels Apr 4, 2021
@NicolasHug NicolasHug marked this pull request as ready for review April 4, 2021 10:46
@NicolasHug NicolasHug mentioned this pull request Apr 4, 2021
Copy link
Member Author

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready for review - CI failures are due to #3631

quick benchmark on 30k boxes:

roialign float            19.7 ms ± 652 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
qroialign torch.qint8     29.7 ms ± 794 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
qroialign torch.quint8    29.8 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
qroialign torch.qint32    29.9 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

torchvision/csrc/ops/quantized/cpu/qroi_align_kernel.cpp Outdated Show resolved Hide resolved
@@ -30,6 +30,123 @@ at::Tensor _roi_align_backward(
int64_t sampling_ratio,
bool aligned);

template <typename T>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved the common code in the header, and added some comments / removed useless parameters. The rest of the code is unchanged.

Maybe there's a better place to put common code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header files located at the root of /ops/ folder contain the "public" methods for C++ API while the cpp files contain the code necessary for calling the dispatcher and registering the operators. They are not meant to contain code of the actual implementation of the method.

Up until now, there was very little need to share code between the CPU and CUDA implementations and as a result you won't find in torchvision's codebase an example that does what you want to do. One approach is to follow the pattern used at PyTorch and move the common code is separate files in the /ops/cpu/ folder. See the Loops.h file which is used both by CPU kernels and quantized.

torchvision/ops/roi_align.py Show resolved Hide resolved
namespace {

template <typename T>
void qroi_align_forward_kernel_impl(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewing this file might be easier by doing a vimdiff or similar with ops/cpu/roi_align_kernel.cpp

int roi_bin_grid_h,
int roi_bin_grid_w,
std::vector<PreCalc<T>>& pre_calc) {
int pre_calc_index = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also consider moving the actual implementation of lengthy methods to separate cpp file.

Comment on lines 38 to 39
int roi_batch_ind = at::native::dequantize_val(
rois_scale, rois_zp, offset_rois[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have issues here with large images and batches.

Imagine that the image is of size 1000x1000, and the boxes thus can be from 0 to 1000 in value. The quantized boxes (which have the same quantizers for all boxes) would not be able to properly index the 0, 1, 2, etc values needed for the image index in the batch.

I think the solution for this is either

  • [easy] disable batch support altogether in the quantized version of the op
  • [medium] try using per-channel quantization of the boxes (so that the quantizer for the indices is different than the quantizer for the coordinates) (need to see how it integrates with the rest of the pipeline)
  • [harder] break BC in the kernels and pass instead a list of tensors for the rois. This would maybe involve more changes down the road in the Faster R-CNN model and would probably be tricky, so I wouldn't recommend this.

Additionally, can you also add a quick test exercising this code-path that I mentioned (with a large image and a few batch elements)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if I understand your concern correctly, this is similar to my previous #3624 (comment):

we can't represent 1 with (scale, zero_point) = (2, 0)

I pushed a check in d6f78ab as we discussed yesterday, but I realize now that it's not correct/enough anyway: it assumes that (scale, zero_point) == (1, 0) which is wrong in general.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

Can you open an issue to track adding support for N > 1 for quantized tensors?

@fmassa fmassa merged commit ad9cc62 into pytorch:master Apr 8, 2021
facebook-github-bot pushed a commit that referenced this pull request Apr 13, 2021
Summary:
* WIP

* clang

* docs

* extracted out common utils

* Use better quantization function and pass tensors as parameters

* proper dequantization

* Some tests

* Dequantization optimization, seems to gain a few ms

* clang-format

* again

* more correct test. Had to remove optimization although it almost works

* Also test aligned=True

* remove useless part

* more docs and comments

* Put back optimization with more robust test

* Added check for index upper bound

* avoid possible overflow

* Move common function into common.h

* oops

* scale=1,zero_point=0 makes more sense

* Force batch size of 1 to prevent any indexingbug

* format

* format again

* updated docstring

* put back description comment for pre_calc_bilinear_interpolate

* revert most changes to docstring as it's taken care of in another PR

Reviewed By: NicolasHug

Differential Revision: D27706946

fbshipit-source-id: 2ae1614c214ea676b4f7705dc0716efd9f34330e
@NicolasHug NicolasHug mentioned this pull request Jul 16, 2021
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed enhancement module: models.quantization Issues related to the quantizable/quantized models module: ops
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants