-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add modulation input for DeformConv2D #2791
Add modulation input for DeformConv2D #2791
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR!
I haven't yet gone in details in the C++ / CUDA parts, but I have a few comments in the python part that could help fix the torchscript tests.
>>> print(out.shape) | ||
>>> # returns | ||
>>> torch.Size([4, 5, 8, 8]) | ||
""" | ||
|
||
_assert_has_ops() | ||
out_channels = weight.shape[0] | ||
|
||
use_mask = mask is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better way of doing this which is torchscript-compliant is to remove use_mask
, and instead do
if mask is not None:
mask = ...
assert mask is not None
this way we don't need to change anything in the bottom.
Thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, I removed the shape assertion in Python code. I also found that it is better to pass use_mask
flag into C++ code for shape assertions, so I did.
@fmassa Thanks for you review! I changed my code. TorchScript now works well. Also, I have changed some assertion logics ( |
Hmmm..., the GPU CI test on Windows works fine, but the Linux one isn't. That's weird... |
Anyway, I'll check as soon as my GPU environment fixed! |
Sounds good. The tests are randomized, so it might be that in some cases they pass and others they fail. But the gradchecks that are erroing on Linux GPU don't seem to be due to numerical error, so there might be a case that has been missed somewhere. I'll rerun the tests to see if the CI status change. |
@fmassa Thanks for your CI re-run, but the situation didn't change. My Linux GPU environment is finally back, and my PR passes GPU tests! This is strange thing, and I am still digging into this. (base) rito@LAPTOP-097573RD:~/vision$ pytest test/test_ops.py::DeformConvTester
================================================================================= test session starts ==================================================================================
platform linux -- Python 3.7.7, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /home/rito/vision
plugins: forked-1.3.0, typeguard-2.9.1, extra-durations-0.1.3, xdist-1.34.0
collected 8 items
test/test_ops.py ........ [100%]
============================================================================== sum of all tests durations ==============================================================================
1357.47s
============================================================================ 8 passed in 1358.45s (0:22:38) ============================================================================ |
I run (base) ubuntu@ip-172-31-29-152:~/vision$ uname -a
Linux ip-172-31-29-152 5.4.0-1025-aws #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
(base) ubuntu@ip-172-31-29-152:~/vision$ nvidia-smi
Sat Oct 17 13:22:19 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 26C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
(base) ubuntu@ip-172-31-29-152:~/vision$ pytest test/test_ops.py
================================= test session starts ==================================
platform linux -- Python 3.8.3, pytest-6.1.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/ubuntu/vision
plugins: cov-2.10.1
collected 62 items
test/test_ops.py .............................................................. [100%]
============================ 62 passed in 321.95s (0:05:21) ============================
(base) ubuntu@ip-172-31-29-152:~/vision$ |
I found insufficient resources errors on CI logs. This test failure only happens on the environment which has small GPU resources, such as CI.
|
@fmassa The Linux GPU CI runs on a small instance. On the other hand, the Windows one is on a medium instance. This led to the test result difference between Linux and Windows. We can reduce # of parallelization, but it makes a reduction in the performance. I recommend that we use the medium instance for Linux GPU CI. Line 442 in a9c78f1
Line 18 in a9c78f1
|
192b955
to
0707337
Compare
@fmassa I tried the running GPU CI on According to the official calculator, the CC < 6 has smaller number of registers per block than CC >= 6 and cannot launch the kernel which uses a large number of registers. So I decide to select the proper thread size whether CC is higher than 6 or not. And now, the PR passed all tests. All-green, except Travis CI! |
Hi @Licht-T! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot for the PR and sorry for the delay in getting back to you, the release was taking most of our time over the past couple of weeks.
This PR looks good to merge, but there are some merge conflicts now, could you look into fixing those?
292c4f6
to
2285693
Compare
3bc4dfd
to
5aa7d87
Compare
Codecov Report
@@ Coverage Diff @@
## master #2791 +/- ##
==========================================
- Coverage 73.42% 72.35% -1.07%
==========================================
Files 99 99
Lines 8817 8820 +3
Branches 1389 1390 +1
==========================================
- Hits 6474 6382 -92
- Misses 1917 1999 +82
- Partials 426 439 +13
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR!
I'm accepting the config.yml
changes for now to get this PR merged, but there might be better ways of handling the caching and CI.
I'll create an issue to track those down.
- restore_cache: | ||
{% raw %} | ||
keys: | ||
- env-v2-linux-{{ arch }}-py<< parameters.python_version >>-{{ checksum ".circleci/unittest/linux/scripts/environment.yml" }}-{{ checksum ".circleci-weekly" }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the recommended way to handle invalid caches on CI is to change the key name, instead of having it be env-v2
let it be env-v3
for example.
@@ -455,21 +455,17 @@ jobs: | |||
resource_class: gpu.small | |||
environment: | |||
image_name: "pytorch/manylinux-cuda101" | |||
PYTHON_VERSION: << parameters.python_version >> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those changes are also present in #2973, and there is some discussion going on with @mthrok on the best way to solve this.
In order to move forward with this PR, I'll be merging this as is but let's maybe revisit this implementation in a follow-up PR following @mthrok and @seemethere feedback.
@@ -33,6 +34,9 @@ def deform_conv2d( | |||
padding (int or Tuple[int, int]): height/width of padding of zeroes around | |||
each image. Default: 0 | |||
dilation (int or Tuple[int, int]): the spacing between kernel elements. Default: 1 | |||
mask (Tensor[batch_size, offset_groups * kernel_height * kernel_width, | |||
out_height, out_width]): masks to be applied for each position in the | |||
convolution kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a follow-up PR, it might be good to mention that now DeformConv2d
implements modulation from https://arxiv.org/abs/1811.11168 as well (also known as DeformConv v2)
* Add modulation input for DeformConv2D * lint * Patch for GPU CI * Remove bad cache on CI
* Add modulation input for DeformConv2D * lint * Patch for GPU CI * Remove bad cache on CI
Summary: `test_backward_cuda_contiguous` and `test_backward_cuda_non_contiguous` have been failing on fbcode for a while with the following error `too many resources requested for launch` which suggests that too may threads per block are requested. This issue was already causing problems in the original PR #2791 (comment), where the author decided that CC >= 6 was a good threshold because with CC >= 6 GPUs have more registers. (CC = Compute Capability) However, I'm not certain that this is actually true: if we look at https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications, it's clear that 6.2 has less registers per thread block than 6.0. So I'm not sure this threshold completely makes sense. Moreover, let's note that that the current tests (as on `master`): - **pass** on OSS linux CI which rely on a P4 GPU (up to last week), i.e. **CC = 6.1** - **pass** on OSS windows CI which relies on a T4 GPU, i.e. **CC = 7.5** - **fail** on the AWS cluster which relies on a V100 GPU, i.e. **CC = 7.0** It is quite unclear to me what kind of resource is "enough" for the tests to pass on both 6.1 and 7.5 but not on 7.0. As a result, I think it's safer to just reduce the number of threads per block, irrespective of the CC. ngimel, fmassa suggested that I tag you here since you could have some valuable insight for us. Thanks! Reviewed By: fmassa Differential Revision: D28641626 fbshipit-source-id: 2618c366c5d18bbb7ebafc33032e7ac6c0404d0b
Summary: `test_backward_cuda_contiguous` and `test_backward_cuda_non_contiguous` have been failing on fbcode for a while with the following error `too many resources requested for launch` which suggests that too may threads per block are requested. This issue was already causing problems in the original PR pytorch#2791 (comment), where the author decided that CC >= 6 was a good threshold because with CC >= 6 GPUs have more registers. (CC = Compute Capability) However, I'm not certain that this is actually true: if we look at https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications, it's clear that 6.2 has less registers per thread block than 6.0. So I'm not sure this threshold completely makes sense. Moreover, let's note that that the current tests (as on `master`): - **pass** on OSS linux CI which rely on a P4 GPU (up to last week), i.e. **CC = 6.1** - **pass** on OSS windows CI which relies on a T4 GPU, i.e. **CC = 7.5** - **fail** on the AWS cluster which relies on a V100 GPU, i.e. **CC = 7.0** It is quite unclear to me what kind of resource is "enough" for the tests to pass on both 6.1 and 7.5 but not on 7.0. As a result, I think it's safer to just reduce the number of threads per block, irrespective of the CC. ngimel, fmassa suggested that I tag you here since you could have some valuable insight for us. Thanks! Reviewed By: fmassa Differential Revision: D28641626 fbshipit-source-id: 2618c366c5d18bbb7ebafc33032e7ac6c0404d0b
This is the implementation of Modulated Deformable Convolution, a.k.a. DCNv2. This closes #1788.