Ensure torchvision operators are added in C++ #2798

bmanga · 2020-10-13T09:50:37Z

The torchvision operators are now registered with the TORCH_LIBRARY macro. However, the macro relies on generating a static variable for the actual registration. This static variable is however removed by the linker because it is never referenced, preventing the actual registration.

This PR adds a reference to the created variable through a function that is used to initialize a dummy inline variable (to avoid redefintion errors). The approach is the same as the old PR #2253.
This breaks the TORCH_LIBRARY macro abstraction, but gets the C++ ops registered.

torchvision/csrc/vision.h

fmassa · 2020-10-13T12:25:47Z

@ezyang can you have a look at this PR? For more context on this PR, see #2796 (comment)

vfdev-5 · 2020-10-13T12:55:25Z

@bmanga I was playing around that in a separate repository and there can be a solution to split declaration/definition of the function that call the dispatcher:

Drawback is that we have to add another cpp file.

What do you think ?

bmanga · 2020-10-13T13:16:52Z

@vfdev-5 I don't think that solves the problem of the registration done by the static variable created by the TORCH_LIBRARY macro, because within vision.cpp it's still never referenced.

vfdev-5 · 2020-10-13T13:49:58Z

@vfdev-5 I don't think that solves the problem of the registration done by the static variable created by the TORCH_LIBRARY macro, because within vision.cpp it's still never referenced.

I see more or less what you mean. Can't explain why it links with my example...
Anyway, and if we call TORCH_LIBRARY and TORCH_LIBRARY_IMPL in vision.h and include it in the app ?

bmanga · 2020-10-13T14:00:37Z

@vfdev-5 What would happen if it's included twice?

vfdev-5 · 2020-10-13T16:10:14Z

@bmanga yes, seems like it is not a good idea to use it in the header.

ezyang · 2020-10-13T16:38:55Z

In the old issue I stated I was OK with trying out the inline variable approach #2134 which solves the problem of including the header multiple times in different compilation units.

However, I feel that this PR is getting a bit confused about which problem it is trying to solve. Remember there are two distinct pruning problems:

The dependency on libtorchvision.so getting pruned
Unused static initializers / functions getting pruned when you statically link against torchvision

I think only (1) is the most pressing, but you don't actually need anything interesting in https://github.com/pytorch/vision/pull/2798/files#diff-99d0948938535154c28f8a0d6a7ca8b3bbdeae5effcafefffae0531a9f7169adR91 to keep libtorchvision.so; literally ANY symbol will do.

The other thing that I am most concerned about is whether or not this works in all situations we care about. So it will be good to explicate a comprehensive testing strategy that includes Linux, OS X and Windows, as well as a the variety of compiler versions we support.

fmassa · 2020-10-13T18:09:20Z

I agree with Ed on having good CI testing for this.

We have added a cmake-specific CI for OSX / Linux / Windows in #2577 , but we can extend it further for different configurations if needed

bmanga · 2020-10-14T11:19:02Z

@ezyang Yes that's true, those are two different problems (I usually also test with a static version of the library).
Would you prefer that I drop the reference to the static init variable? I think it would be good if we kept the door open for static builds of the lib.

@fmassa I have added a PR (#2807) to test successful registration in C++.

ezyang · 2020-10-15T03:02:52Z

Would you prefer that I drop the reference to the static init variable? I think it would be good if we kept the door open for static builds of the lib.

Sure. But we can probably do something simpler. IIRC, if we have a reference to any symbol in the object file that does the registration, that's sufficient to retain it (and its static initializers). So you don't have to grovel in the internal symbols.

…ailable in C++

bmanga · 2020-10-15T09:06:33Z

@ezyang I tested it without the reference and it seems to work fine. I'm just going to just use cuda_version then.

…lizer

codecov · 2020-10-15T10:05:46Z

Codecov Report

Merging #2798 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2798      +/-   ##
==========================================
- Coverage   73.31%   73.29%   -0.03%     
==========================================
  Files          99       99              
  Lines        8724     9181     +457     
  Branches     1373     1491     +118     
==========================================
+ Hits         6396     6729     +333     
- Misses       1909     2028     +119     
- Partials      419      424       +5

Impacted Files	Coverage Δ
torchvision/io/image.py	`76.74% <0.00%> (-12.39%)`	⬇️
torchvision/extension.py	`58.66% <0.00%> (-9.52%)`	⬇️
torchvision/io/_video_opt.py	`30.79% <0.00%> (-8.58%)`	⬇️
torchvision/transforms/functional_tensor.py	`75.93% <0.00%> (+2.02%)`	⬆️
torchvision/transforms/transforms.py	`85.73% <0.00%> (+4.74%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f222e8...f8b3c7d. Read the comment docs.

ezyang · 2020-10-15T14:36:10Z

torchvision/csrc/vision.h

+// Dummy variable to reference a symbol from vision.cpp.
+// This ensures that the torchvision library and the ops registration
+// initializers are not pruned.
+VISION_INLINE_VARIABLE int64_t _cuda_version = cuda_version();


I think it will be better to store a reference to the function pointer, rather than doing a full static initializer (which will bang on cuda version even though there's no reason to do so.)

fmassa · 2020-10-16T10:00:32Z

Merging following @ezyang approval. Thanks a lot @bmanga for the PR and @ezyang and @vfdev-5 for the review!

vfdev-5 · 2020-10-16T11:23:04Z

@bmanga I'm still getting the issue with libtorchvision.so being pruned while testing torchvision master with this PR inside.
Here is what I'm doing:

git clone https://github.com/pytorch/vision
cd vision

# build C++ library
export TORCH_PATH=$(dirname $(python -c "import torch; print(torch.__file__)"))
# /opt/conda/lib/python3.7/site-packages/torch
rm -rf cpp_build && mkdir cpp_build && cd cpp_build && cmake .. -DTorch_DIR=$TORCH_PATH/share/cmake/Torch -DWITH_CUDA=on && make install

# build test_frcnn_tracing
cd ../test/tracing/frcnn && rm -rf build && mkdir build && cd build && cmake .. -DTorch_DIR=$TORCH_PATH/share/cmake/Torch -DWITH_CUDA=on && make


ldd test_frcnn_tracing
        linux-vdso.so.1 (0x00007ffe64755000)
        libc10.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so (0x00007fb5c9c24000)
        libtorch.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so (0x00007fb5c9a10000)
        libtorch_cpu.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so (0x00007fb5c2b28000)
        libtorch_cuda.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so (0x00007fb58cd15000)
        libstdc++.so.6 => /opt/conda/lib/libstdc++.so.6 (0x00007fb5ca1fa000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb58c977000)
        libgcc_s.so.1 => /opt/conda/lib/libgcc_s.so.1 (0x00007fb5ca1de000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb58c586000)
        libgomp.so.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libgomp.so.1 (0x00007fb5ca1b1000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb58c367000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb5ca149000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb58c163000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb58bf5b000)
        libmkl_intel_lp64.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_intel_lp64.so (0x00007fb58b24a000)
        libmkl_gnu_thread.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_gnu_thread.so (0x00007fb58951f000)
        libmkl_core.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_core.so (0x00007fb584f59000)
        libcudart.so.10.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.1 (0x00007fb584cda000)
        libc10_cuda.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so (0x00007fb584aab000)
        libcusparse.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcusparse.so.10 (0x00007fb57d820000)
        libcurand.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcurand.so.10 (0x00007fb5797be000)
        libcusolver.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcusolver.so.10 (0x00007fb56ecad000)
        libnvToolsExt.so.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libnvToolsExt.so.1 (0x00007fb56eaa3000)
        libcufft.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcufft.so.10 (0x00007fb566467000)
        libcublas.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcublas.so.10 (0x00007fb5626c9000)
        libcublasLt.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../.././libcublasLt.so.10 (0x00007fb560823000)

g++ --version
# g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0

Any ideas ?

PS: I'm inside pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel docker container and have cmake 3.13.1

bmanga · 2020-10-16T11:37:32Z

@vfdev-5 can you try including <torchvision/vision.h> in test_frcnn_tracing.cpp?
I don't think it's indirectly included by any other torchvision header file.

vfdev-5 · 2020-10-16T11:42:12Z

@bmanga same even with including <torchvision/vision.h>. Can you check it from your side if you could reproduce it, please ?

vfdev-5 · 2020-10-16T11:48:49Z

@bmanga sorry, my bad. Inserted into another file, not the one I builded.

In this case, libtorchvision.so is present but there is a small messing with cuda

 ldd test_frcnn_tracing
        linux-vdso.so.1 (0x00007fffab186000)
        libc10.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so (0x00007fcd4c92b000)
        libtorchvision.so => /usr/local/lib/libtorchvision.so (0x00007fcd4c2dc000)
        libtorch.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so (0x00007fcd4c0c8000)
        libtorch_cpu.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so (0x00007fcd451e0000)
        libtorch_cuda.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so (0x00007fcd0f3cd000)
        libstdc++.so.6 => /opt/conda/lib/libstdc++.so.6 (0x00007fcd4ceff000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fcd0f02f000)
        libgcc_s.so.1 => /opt/conda/lib/libgcc_s.so.1 (0x00007fcd4cee5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fcd0ec3e000)
        libgomp.so.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libgomp.so.1 (0x00007fcd4ceb8000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fcd0ea1f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fcd4ce50000)
        libcudart.so.10.1 => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1 (0x00007fcd0e7a3000)
        libc10_cuda.so => not found
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fcd0e59f000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fcd0e397000)
        libmkl_intel_lp64.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_intel_lp64.so (0x00007fcd0d686000)
        libmkl_gnu_thread.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_gnu_thread.so (0x00007fcd0b95b000)
        libmkl_core.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_core.so (0x00007fcd07395000)
        libc10_cuda.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so (0x00007fcd07166000)
        libcusparse.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcusparse.so.10 (0x00007fccffedb000)
        libcurand.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcurand.so.10 (0x00007fccfbe79000)
        libcusolver.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcusolver.so.10 (0x00007fccf1368000)
        libnvToolsExt.so.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libnvToolsExt.so.1 (0x00007fccf115e000)
        libcufft.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcufft.so.10 (0x00007fcce8b22000)
        libcublas.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcublas.so.10 (0x00007fcce4d84000)
        libcublasLt.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../.././libcublasLt.so.10 (0x00007fcce2ede000)

Probably, this is for another issue to file.

bmanga · 2020-10-16T11:59:26Z

@vfdev-5 No problem, glad it works :). I wonder if <torchvision/vision.h> should be indirectly included by the other headers by default. Either way it's interesting that test_frcnn_tracing works at all in the current CI, and I'm not yet sure why that is.

@fmassa thanks for the merge! I didn't get around applying the last suggestion by @ezyang but that can be done in a separate PR.

* Ensure torchvision operators are registered in C++ via weak symbols * Add note to README on how to ensure that torchvision operators are available in C++ * Fix dllimport/dllexport on windows, format files * Factor out common macros in single file * Expose cuda_version in the API, use it to avoid pruning of ops initializer

datumbox · 2020-12-02T16:19:09Z

@bmanga

Originally we thought that this PR would help us eliminate the need of the below:

vision/test/tracing/frcnn/test_frcnn_tracing.cpp

Lines 6 to 12 in a075d62

    
           #include <torchvision/nms.h> 
        
           #ifdef _WIN32 
        
           // Windows only 
        
           // This is necessary until operators are automatically registered on include 
        
           static auto _nms = &nms_cpu; 
        
           #endif

and replace it with #include <torchvision/vision.h>. Unfortunately that's not the case. Any ideas?

* Ensure torchvision operators are registered in C++ via weak symbols * Add note to README on how to ensure that torchvision operators are available in C++ * Fix dllimport/dllexport on windows, format files * Factor out common macros in single file * Expose cuda_version in the API, use it to avoid pruning of ops initializer

bmanga · 2020-12-04T19:28:39Z

That's odd, it should work. I will have a look tomorrow.

datumbox · 2020-12-15T13:20:37Z

FYI @bmanga we have recently made a major refactoring on the C++ codebase but the problem still persists.

fmassa · 2021-02-10T11:04:30Z

@bmanga FYI we are still having reports that the current approach is not enough on some systems to avoid the symbols to be stripped, and users are having to do things like this to get it working on all systems.

So I'll be moving forward with the approach proposed by @ezyang in #2134 (comment) and include --no-as-needed. Let me know if you have any concerns about this.

bmanga · 2021-02-10T12:47:30Z

@datumbox @fmassa sorry about being MIA until now. If you give me until this evening I can try to figure out what is going on, otherwise feel free to adjust the cmake options.

bmanga · 2021-02-10T12:55:24Z

@datumbox @fmassa Could you point me to some issues/code that are exposing the problems? I have been a bit out of the loop recently and I am finding it hard to find the relevant discussions.

fmassa · 2021-02-10T13:10:43Z

@bmanga if you could have the time to look at this tonight it would be great, otherwise I'll try to push the --no-as-needed myself.

Here is one example where we need explicit linkage for things to work

vision/test/tracing/frcnn/test_frcnn_tracing.cpp

Lines 6 to 10 in 1b7c0f5

    
           #ifdef _WIN32 
        
           // Windows only 
        
           // This is necessary until operators are automatically registered on include 
        
           static auto _nms = &vision::ops::nms; 
        
           #endif

Without this line, CI fails on Windows, so just including torchvision is not enough.

Another example is in detectron2, where they had to add --no-as-needed in their CMakeLists otherwise it wouldn't work.

bmanga · 2021-02-10T13:14:23Z

@fmassa I'm confused. --no-as-needed is a GNU ld option, but you are saying things are failing on windows. Does adding that option fix the windows build?

fmassa · 2021-02-10T13:41:11Z

@bmanga I haven't tried that yet on torchvision, and I don't know in which systems torchvision ops were not being correctly picked by detectron2, @ppwwyyxx can you comment on that?

bmanga · 2021-02-10T22:50:39Z

I managed to set up the windows environment and reproduce the error, but I couldn't find a fix today. I will investigate more tomorrow if you can wait a bit longer.

ppwwyyxx · 2021-02-11T00:25:52Z

When I added the flag in detectron2 I was testing with the released torchvision at that time, which I think doesn't include this PR.

bmanga · 2021-02-11T15:02:41Z

Tentative fix in #3380

vfdev-5 reviewed Oct 13, 2020

View reviewed changes

torchvision/csrc/vision.h Outdated Show resolved Hide resolved

bmanga force-pushed the register-ops-inline-var-2 branch from 3343df4 to f205585 Compare October 13, 2020 12:51

bmanga force-pushed the register-ops-inline-var-2 branch from f205585 to 0044787 Compare October 13, 2020 13:22

bmanga mentioned this pull request Oct 14, 2020

Add test that returns number of registered vision ops #2807

Open

bmanga added 4 commits October 15, 2020 10:41

Ensure torchvision operators are registered in C++ via weak symbols

fb893e7

Add note to README on how to ensure that torchvision operators are av…

eca76e6

…ailable in C++

Fix dllimport/dllexport on windows, format files

535504d

Factor out common macros in single file

2c3b9d0

Expose cuda_version in the API, use it to avoid pruning of ops initia…

f8b3c7d

…lizer

bmanga force-pushed the register-ops-inline-var-2 branch from 0044787 to f8b3c7d Compare October 15, 2020 09:12

ezyang reviewed Oct 15, 2020

View reviewed changes

ezyang approved these changes Oct 15, 2020

View reviewed changes

fmassa merged commit adfc15c into pytorch:master Oct 16, 2020

This was referenced Oct 28, 2020

Serious problem !!! C++ torchvision does not register nms in Torchscript #2915

Closed

Cannot load scripted Keypoint-RCNN model in C++ #2736

Closed

fmassa mentioned this pull request Dec 9, 2020

Fix windows build issues #3087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure torchvision operators are added in C++ #2798

Ensure torchvision operators are added in C++ #2798

bmanga commented Oct 13, 2020

fmassa commented Oct 13, 2020

vfdev-5 commented Oct 13, 2020

bmanga commented Oct 13, 2020

vfdev-5 commented Oct 13, 2020 •

edited

Loading

bmanga commented Oct 13, 2020

vfdev-5 commented Oct 13, 2020

ezyang commented Oct 13, 2020

fmassa commented Oct 13, 2020 •

edited

Loading

bmanga commented Oct 14, 2020

ezyang commented Oct 15, 2020

bmanga commented Oct 15, 2020

codecov bot commented Oct 15, 2020 •

edited

Loading

ezyang Oct 15, 2020

fmassa commented Oct 16, 2020

vfdev-5 commented Oct 16, 2020 •

edited

Loading

bmanga commented Oct 16, 2020

vfdev-5 commented Oct 16, 2020

vfdev-5 commented Oct 16, 2020 •

edited

Loading

bmanga commented Oct 16, 2020

datumbox commented Dec 2, 2020

bmanga commented Dec 4, 2020

datumbox commented Dec 15, 2020

fmassa commented Feb 10, 2021

bmanga commented Feb 10, 2021

bmanga commented Feb 10, 2021

fmassa commented Feb 10, 2021

bmanga commented Feb 10, 2021

fmassa commented Feb 10, 2021

bmanga commented Feb 10, 2021

ppwwyyxx commented Feb 11, 2021

bmanga commented Feb 11, 2021

Ensure torchvision operators are added in C++ #2798

Ensure torchvision operators are added in C++ #2798

Conversation

bmanga commented Oct 13, 2020

fmassa commented Oct 13, 2020

vfdev-5 commented Oct 13, 2020

bmanga commented Oct 13, 2020

vfdev-5 commented Oct 13, 2020 • edited Loading

bmanga commented Oct 13, 2020

vfdev-5 commented Oct 13, 2020

ezyang commented Oct 13, 2020

fmassa commented Oct 13, 2020 • edited Loading

bmanga commented Oct 14, 2020

ezyang commented Oct 15, 2020

bmanga commented Oct 15, 2020

codecov bot commented Oct 15, 2020 • edited Loading

Codecov Report

ezyang Oct 15, 2020

Choose a reason for hiding this comment

fmassa commented Oct 16, 2020

vfdev-5 commented Oct 16, 2020 • edited Loading

bmanga commented Oct 16, 2020

vfdev-5 commented Oct 16, 2020

vfdev-5 commented Oct 16, 2020 • edited Loading

bmanga commented Oct 16, 2020

datumbox commented Dec 2, 2020

bmanga commented Dec 4, 2020

datumbox commented Dec 15, 2020

fmassa commented Feb 10, 2021

bmanga commented Feb 10, 2021

bmanga commented Feb 10, 2021

fmassa commented Feb 10, 2021

bmanga commented Feb 10, 2021

fmassa commented Feb 10, 2021

bmanga commented Feb 10, 2021

ppwwyyxx commented Feb 11, 2021

bmanga commented Feb 11, 2021

vfdev-5 commented Oct 13, 2020 •

edited

Loading

fmassa commented Oct 13, 2020 •

edited

Loading

codecov bot commented Oct 15, 2020 •

edited

Loading

vfdev-5 commented Oct 16, 2020 •

edited

Loading

vfdev-5 commented Oct 16, 2020 •

edited

Loading