Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add modulation input for DeformConv2D #2791

Merged

Conversation

Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Oct 11, 2020

This is the implementation of Modulated Deformable Convolution, a.k.a. DCNv2. This closes #1788.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR!

I haven't yet gone in details in the C++ / CUDA parts, but I have a few comments in the python part that could help fix the torchscript tests.

torchvision/ops/deform_conv.py Outdated Show resolved Hide resolved
torchvision/ops/deform_conv.py Outdated Show resolved Hide resolved
>>> print(out.shape)
>>> # returns
>>> torch.Size([4, 5, 8, 8])
"""

_assert_has_ops()
out_channels = weight.shape[0]

use_mask = mask is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better way of doing this which is torchscript-compliant is to remove use_mask, and instead do

if mask is not None:
    mask = ...

assert mask is not None

this way we don't need to change anything in the bottom.

Thoughts ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, I removed the shape assertion in Python code. I also found that it is better to pass use_mask flag into C++ code for shape assertions, so I did.

@Licht-T
Copy link
Contributor Author

Licht-T commented Oct 13, 2020

@fmassa Thanks for you review! I changed my code. TorchScript now works well.

Also, I have changed some assertion logics (use_mask related) so that it can correctly determine whether it use the modulation or not. I've tested this on CPU forward/backward. Unfortunately, since my GPU environment has broken last night, I have not yet tested this on GPU. I'll fix up my environment in couple of days and test this on GPU.

@Licht-T
Copy link
Contributor Author

Licht-T commented Oct 13, 2020

Hmmm..., the GPU CI test on Windows works fine, but the Linux one isn't. That's weird...

@Licht-T
Copy link
Contributor Author

Licht-T commented Oct 13, 2020

Anyway, I'll check as soon as my GPU environment fixed!

@fmassa
Copy link
Member

fmassa commented Oct 13, 2020

Sounds good.

The tests are randomized, so it might be that in some cases they pass and others they fail. But the gradchecks that are erroing on Linux GPU don't seem to be due to numerical error, so there might be a case that has been missed somewhere.

I'll rerun the tests to see if the CI status change.

@Licht-T
Copy link
Contributor Author

Licht-T commented Oct 16, 2020

@fmassa Thanks for your CI re-run, but the situation didn't change.

My Linux GPU environment is finally back, and my PR passes GPU tests! This is strange thing, and I am still digging into this.

(base) rito@LAPTOP-097573RD:~/vision$ pytest test/test_ops.py::DeformConvTester
================================================================================= test session starts ==================================================================================
platform linux -- Python 3.7.7, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /home/rito/vision
plugins: forked-1.3.0, typeguard-2.9.1, extra-durations-0.1.3, xdist-1.34.0
collected 8 items

test/test_ops.py ........                                                                                                                                                        [100%]

============================================================================== sum of all tests durations ==============================================================================
1357.47s
============================================================================ 8 passed in 1358.45s (0:22:38) ============================================================================

@Licht-T
Copy link
Contributor Author

Licht-T commented Oct 17, 2020

I run test_ops.py tests on the AWS g4dn.xlarge instance, and no test failure happened.

(base) ubuntu@ip-172-31-29-152:~/vision$ uname -a
Linux ip-172-31-29-152 5.4.0-1025-aws #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
(base) ubuntu@ip-172-31-29-152:~/vision$ nvidia-smi
Sat Oct 17 13:22:19 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   26C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(base) ubuntu@ip-172-31-29-152:~/vision$ pytest test/test_ops.py
================================= test session starts ==================================
platform linux -- Python 3.8.3, pytest-6.1.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/ubuntu/vision
plugins: cov-2.10.1
collected 62 items

test/test_ops.py ..............................................................  [100%]

============================ 62 passed in 321.95s (0:05:21) ============================
(base) ubuntu@ip-172-31-29-152:~/vision$

@Licht-T
Copy link
Contributor Author

Licht-T commented Oct 17, 2020

I found insufficient resources errors on CI logs. This test failure only happens on the environment which has small GPU resources, such as CI.
https://app.circleci.com/pipelines/github/pytorch/vision/4464/workflows/4d7daf2d-7466-460c-a1f7-76ae3495653c/jobs/266953/steps

error in compute_grad_offset_and_mask: too many resources requested for launch

@Licht-T
Copy link
Contributor Author

Licht-T commented Oct 17, 2020

@fmassa The Linux GPU CI runs on a small instance. On the other hand, the Windows one is on a medium instance. This led to the test result difference between Linux and Windows. We can reduce # of parallelization, but it makes a reduction in the performance. I recommend that we use the medium instance for Linux GPU CI.

resource_class: gpu.small

resource_class: windows.gpu.nvidia.medium

@Licht-T Licht-T force-pushed the add-modulation-for-deformable-convolution branch from 192b955 to 0707337 Compare October 18, 2020 15:42
@Licht-T
Copy link
Contributor Author

Licht-T commented Oct 18, 2020

@fmassa I tried the running GPU CI on gpu.medium instance, but it did not make sense. Finally, I found that the Linux environment is CUDA Compute Capability (CC) 5.2, but the Windows one is CC 7.5. The Linux instance has the old GPU even if it is "medium" 🤔 .

According to the official calculator, the CC < 6 has smaller number of registers per block than CC >= 6 and cannot launch the kernel which uses a large number of registers. So I decide to select the proper thread size whether CC is higher than 6 or not.

And now, the PR passed all tests. All-green, except Travis CI!

@facebook-github-bot
Copy link

Hi @Licht-T!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for the PR and sorry for the delay in getting back to you, the release was taking most of our time over the past couple of weeks.

This PR looks good to merge, but there are some merge conflicts now, could you look into fixing those?

@Licht-T Licht-T force-pushed the add-modulation-for-deformable-convolution branch from 292c4f6 to 2285693 Compare November 8, 2020 09:55
@Licht-T Licht-T force-pushed the add-modulation-for-deformable-convolution branch from 3bc4dfd to 5aa7d87 Compare November 8, 2020 11:29
@codecov
Copy link

codecov bot commented Nov 8, 2020

Codecov Report

Merging #2791 (5aa7d87) into master (052edce) will decrease coverage by 1.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2791      +/-   ##
==========================================
- Coverage   73.42%   72.35%   -1.07%     
==========================================
  Files          99       99              
  Lines        8817     8820       +3     
  Branches     1389     1390       +1     
==========================================
- Hits         6474     6382      -92     
- Misses       1917     1999      +82     
- Partials      426      439      +13     
Impacted Files Coverage Δ
torchvision/ops/deform_conv.py 72.30% <100.00%> (+1.33%) ⬆️
torchvision/ops/_register_onnx_ops.py 48.78% <0.00%> (-36.59%) ⬇️
torchvision/models/detection/transform.py 78.33% <0.00%> (-17.23%) ⬇️
torchvision/ops/poolers.py 86.53% <0.00%> (-11.54%) ⬇️
torchvision/ops/boxes.py 87.35% <0.00%> (-8.05%) ⬇️
torchvision/models/detection/roi_heads.py 77.23% <0.00%> (-5.11%) ⬇️
torchvision/models/detection/rpn.py 90.18% <0.00%> (-3.69%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 052edce...5aa7d87. Read the comment docs.

@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 8, 2020

@fmassa No problem! Now, fixed conflicts and CI is all green except CodeCov!

Also, I made #2973 since GPU CI was not working well.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR!

I'm accepting the config.yml changes for now to get this PR merged, but there might be better ways of handling the caching and CI.

I'll create an issue to track those down.

- restore_cache:
{% raw %}
keys:
- env-v2-linux-{{ arch }}-py<< parameters.python_version >>-{{ checksum ".circleci/unittest/linux/scripts/environment.yml" }}-{{ checksum ".circleci-weekly" }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the recommended way to handle invalid caches on CI is to change the key name, instead of having it be env-v2 let it be env-v3 for example.

@@ -455,21 +455,17 @@ jobs:
resource_class: gpu.small
environment:
image_name: "pytorch/manylinux-cuda101"
PYTHON_VERSION: << parameters.python_version >>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those changes are also present in #2973, and there is some discussion going on with @mthrok on the best way to solve this.

In order to move forward with this PR, I'll be merging this as is but let's maybe revisit this implementation in a follow-up PR following @mthrok and @seemethere feedback.

@@ -33,6 +34,9 @@ def deform_conv2d(
padding (int or Tuple[int, int]): height/width of padding of zeroes around
each image. Default: 0
dilation (int or Tuple[int, int]): the spacing between kernel elements. Default: 1
mask (Tensor[batch_size, offset_groups * kernel_height * kernel_width,
out_height, out_width]): masks to be applied for each position in the
convolution kernel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a follow-up PR, it might be good to mention that now DeformConv2d implements modulation from https://arxiv.org/abs/1811.11168 as well (also known as DeformConv v2)

@fmassa fmassa merged commit 5a4bb19 into pytorch:master Nov 9, 2020
bryant1410 pushed a commit to bryant1410/vision-1 that referenced this pull request Nov 22, 2020
* Add modulation input for DeformConv2D

* lint

* Patch for GPU CI

* Remove bad cache on CI
vfdev-5 pushed a commit to Quansight/vision that referenced this pull request Dec 4, 2020
* Add modulation input for DeformConv2D

* lint

* Patch for GPU CI

* Remove bad cache on CI
facebook-github-bot pushed a commit that referenced this pull request May 27, 2021
Summary:
`test_backward_cuda_contiguous` and `test_backward_cuda_non_contiguous` have been failing on fbcode for a while with the following error `too many resources requested for launch` which suggests that too may threads per block are requested.

This issue was already causing problems in the original PR #2791 (comment), where the author decided that CC >= 6 was a good threshold because with CC >= 6 GPUs have more registers. (CC = Compute Capability)

However, I'm not certain that this is actually true: if we look at https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications, it's clear that 6.2 has less registers per thread block than 6.0. So I'm not sure this threshold completely makes sense.

Moreover, let's note that that the current tests (as on `master`):

- **pass** on OSS linux CI which rely on a P4 GPU (up to last week), i.e. **CC = 6.1**
- **pass** on OSS windows CI which relies on a T4 GPU, i.e. **CC = 7.5**
- **fail** on the AWS cluster which relies on a V100 GPU, i.e. **CC = 7.0**

It is quite unclear to me what kind of resource is "enough" for the tests to pass on both 6.1 and 7.5 but not on 7.0. As a result, I think it's safer to just reduce the number of threads per block, irrespective of the CC.

ngimel,  fmassa suggested that I tag you here since you could have some valuable insight for us. Thanks!

Reviewed By: fmassa

Differential Revision: D28641626

fbshipit-source-id: 2618c366c5d18bbb7ebafc33032e7ac6c0404d0b
NicolasHug added a commit to NicolasHug/vision that referenced this pull request Jun 1, 2021
Summary:
`test_backward_cuda_contiguous` and `test_backward_cuda_non_contiguous` have been failing on fbcode for a while with the following error `too many resources requested for launch` which suggests that too may threads per block are requested.

This issue was already causing problems in the original PR pytorch#2791 (comment), where the author decided that CC >= 6 was a good threshold because with CC >= 6 GPUs have more registers. (CC = Compute Capability)

However, I'm not certain that this is actually true: if we look at https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications, it's clear that 6.2 has less registers per thread block than 6.0. So I'm not sure this threshold completely makes sense.

Moreover, let's note that that the current tests (as on `master`):

- **pass** on OSS linux CI which rely on a P4 GPU (up to last week), i.e. **CC = 6.1**
- **pass** on OSS windows CI which relies on a T4 GPU, i.e. **CC = 7.5**
- **fail** on the AWS cluster which relies on a V100 GPU, i.e. **CC = 7.0**

It is quite unclear to me what kind of resource is "enough" for the tests to pass on both 6.1 and 7.5 but not on 7.0. As a result, I think it's safer to just reduce the number of threads per block, irrespective of the CC.

ngimel,  fmassa suggested that I tag you here since you could have some valuable insight for us. Thanks!

Reviewed By: fmassa

Differential Revision: D28641626

fbshipit-source-id: 2618c366c5d18bbb7ebafc33032e7ac6c0404d0b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add deformable convolution DCNv2 support
3 participants