Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure torchvision operators are added in C++ #2798

Merged
merged 5 commits into from
Oct 16, 2020

Conversation

bmanga
Copy link
Contributor

@bmanga bmanga commented Oct 13, 2020

The torchvision operators are now registered with the TORCH_LIBRARY macro. However, the macro relies on generating a static variable for the actual registration. This static variable is however removed by the linker because it is never referenced, preventing the actual registration.

This PR adds a reference to the created variable through a function that is used to initialize a dummy inline variable (to avoid redefintion errors). The approach is the same as the old PR #2253.
This breaks the TORCH_LIBRARY macro abstraction, but gets the C++ ops registered.

torchvision/csrc/vision.h Outdated Show resolved Hide resolved
@fmassa
Copy link
Member

fmassa commented Oct 13, 2020

@ezyang can you have a look at this PR? For more context on this PR, see #2796 (comment)

@bmanga bmanga force-pushed the register-ops-inline-var-2 branch from 3343df4 to f205585 Compare October 13, 2020 12:51
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 13, 2020

@bmanga I was playing around that in a separate repository and there can be a solution to split declaration/definition of the function that call the dispatcher:

Drawback is that we have to add another cpp file.

What do you think ?

@bmanga
Copy link
Contributor Author

bmanga commented Oct 13, 2020

@vfdev-5 I don't think that solves the problem of the registration done by the static variable created by the TORCH_LIBRARY macro, because within vision.cpp it's still never referenced.

@bmanga bmanga force-pushed the register-ops-inline-var-2 branch from f205585 to 0044787 Compare October 13, 2020 13:22
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 13, 2020

@vfdev-5 I don't think that solves the problem of the registration done by the static variable created by the TORCH_LIBRARY macro, because within vision.cpp it's still never referenced.

I see more or less what you mean. Can't explain why it links with my example...
Anyway, and if we call TORCH_LIBRARY and TORCH_LIBRARY_IMPL in vision.h and include it in the app ?

@bmanga
Copy link
Contributor Author

bmanga commented Oct 13, 2020

@vfdev-5 What would happen if it's included twice?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 13, 2020

@bmanga yes, seems like it is not a good idea to use it in the header.

@ezyang
Copy link
Contributor

ezyang commented Oct 13, 2020

In the old issue I stated I was OK with trying out the inline variable approach #2134 which solves the problem of including the header multiple times in different compilation units.

However, I feel that this PR is getting a bit confused about which problem it is trying to solve. Remember there are two distinct pruning problems:

  1. The dependency on libtorchvision.so getting pruned
  2. Unused static initializers / functions getting pruned when you statically link against torchvision

I think only (1) is the most pressing, but you don't actually need anything interesting in https://github.com/pytorch/vision/pull/2798/files#diff-99d0948938535154c28f8a0d6a7ca8b3bbdeae5effcafefffae0531a9f7169adR91 to keep libtorchvision.so; literally ANY symbol will do.

The other thing that I am most concerned about is whether or not this works in all situations we care about. So it will be good to explicate a comprehensive testing strategy that includes Linux, OS X and Windows, as well as a the variety of compiler versions we support.

@fmassa
Copy link
Member

fmassa commented Oct 13, 2020

I agree with Ed on having good CI testing for this.

We have added a cmake-specific CI for OSX / Linux / Windows in #2577 , but we can extend it further for different configurations if needed

@bmanga
Copy link
Contributor Author

bmanga commented Oct 14, 2020

@ezyang Yes that's true, those are two different problems (I usually also test with a static version of the library).
Would you prefer that I drop the reference to the static init variable? I think it would be good if we kept the door open for static builds of the lib.

@fmassa I have added a PR (#2807) to test successful registration in C++.

@ezyang
Copy link
Contributor

ezyang commented Oct 15, 2020

Would you prefer that I drop the reference to the static init variable? I think it would be good if we kept the door open for static builds of the lib.

Sure. But we can probably do something simpler. IIRC, if we have a reference to any symbol in the object file that does the registration, that's sufficient to retain it (and its static initializers). So you don't have to grovel in the internal symbols.

@bmanga
Copy link
Contributor Author

bmanga commented Oct 15, 2020

@ezyang I tested it without the reference and it seems to work fine. I'm just going to just use cuda_version then.

@bmanga bmanga force-pushed the register-ops-inline-var-2 branch from 0044787 to f8b3c7d Compare October 15, 2020 09:12
@codecov
Copy link

codecov bot commented Oct 15, 2020

Codecov Report

Merging #2798 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2798      +/-   ##
==========================================
- Coverage   73.31%   73.29%   -0.03%     
==========================================
  Files          99       99              
  Lines        8724     9181     +457     
  Branches     1373     1491     +118     
==========================================
+ Hits         6396     6729     +333     
- Misses       1909     2028     +119     
- Partials      419      424       +5     
Impacted Files Coverage Δ
torchvision/io/image.py 76.74% <0.00%> (-12.39%) ⬇️
torchvision/extension.py 58.66% <0.00%> (-9.52%) ⬇️
torchvision/io/_video_opt.py 30.79% <0.00%> (-8.58%) ⬇️
torchvision/transforms/functional_tensor.py 75.93% <0.00%> (+2.02%) ⬆️
torchvision/transforms/transforms.py 85.73% <0.00%> (+4.74%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f222e8...f8b3c7d. Read the comment docs.

// Dummy variable to reference a symbol from vision.cpp.
// This ensures that the torchvision library and the ops registration
// initializers are not pruned.
VISION_INLINE_VARIABLE int64_t _cuda_version = cuda_version();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be better to store a reference to the function pointer, rather than doing a full static initializer (which will bang on cuda version even though there's no reason to do so.)

@fmassa
Copy link
Member

fmassa commented Oct 16, 2020

Merging following @ezyang approval. Thanks a lot @bmanga for the PR and @ezyang and @vfdev-5 for the review!

@fmassa fmassa merged commit adfc15c into pytorch:master Oct 16, 2020
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 16, 2020

@bmanga I'm still getting the issue with libtorchvision.so being pruned while testing torchvision master with this PR inside.
Here is what I'm doing:

git clone https://github.com/pytorch/vision
cd vision

# build C++ library
export TORCH_PATH=$(dirname $(python -c "import torch; print(torch.__file__)"))
# /opt/conda/lib/python3.7/site-packages/torch
rm -rf cpp_build && mkdir cpp_build && cd cpp_build && cmake .. -DTorch_DIR=$TORCH_PATH/share/cmake/Torch -DWITH_CUDA=on && make install

# build test_frcnn_tracing
cd ../test/tracing/frcnn && rm -rf build && mkdir build && cd build && cmake .. -DTorch_DIR=$TORCH_PATH/share/cmake/Torch -DWITH_CUDA=on && make


ldd test_frcnn_tracing
        linux-vdso.so.1 (0x00007ffe64755000)
        libc10.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so (0x00007fb5c9c24000)
        libtorch.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so (0x00007fb5c9a10000)
        libtorch_cpu.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so (0x00007fb5c2b28000)
        libtorch_cuda.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so (0x00007fb58cd15000)
        libstdc++.so.6 => /opt/conda/lib/libstdc++.so.6 (0x00007fb5ca1fa000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb58c977000)
        libgcc_s.so.1 => /opt/conda/lib/libgcc_s.so.1 (0x00007fb5ca1de000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb58c586000)
        libgomp.so.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libgomp.so.1 (0x00007fb5ca1b1000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb58c367000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb5ca149000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb58c163000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb58bf5b000)
        libmkl_intel_lp64.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_intel_lp64.so (0x00007fb58b24a000)
        libmkl_gnu_thread.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_gnu_thread.so (0x00007fb58951f000)
        libmkl_core.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_core.so (0x00007fb584f59000)
        libcudart.so.10.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.1 (0x00007fb584cda000)
        libc10_cuda.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so (0x00007fb584aab000)
        libcusparse.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcusparse.so.10 (0x00007fb57d820000)
        libcurand.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcurand.so.10 (0x00007fb5797be000)
        libcusolver.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcusolver.so.10 (0x00007fb56ecad000)
        libnvToolsExt.so.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libnvToolsExt.so.1 (0x00007fb56eaa3000)
        libcufft.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcufft.so.10 (0x00007fb566467000)
        libcublas.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcublas.so.10 (0x00007fb5626c9000)
        libcublasLt.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../.././libcublasLt.so.10 (0x00007fb560823000)

g++ --version
# g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0

Any ideas ?

PS: I'm inside pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel docker container and have cmake 3.13.1

@bmanga
Copy link
Contributor Author

bmanga commented Oct 16, 2020

@vfdev-5 can you try including <torchvision/vision.h> in test_frcnn_tracing.cpp?
I don't think it's indirectly included by any other torchvision header file.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 16, 2020

@bmanga same even with including <torchvision/vision.h>. Can you check it from your side if you could reproduce it, please ?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 16, 2020

@bmanga sorry, my bad. Inserted into another file, not the one I builded.

In this case, libtorchvision.so is present but there is a small messing with cuda

 ldd test_frcnn_tracing
        linux-vdso.so.1 (0x00007fffab186000)
        libc10.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so (0x00007fcd4c92b000)
        libtorchvision.so => /usr/local/lib/libtorchvision.so (0x00007fcd4c2dc000)
        libtorch.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so (0x00007fcd4c0c8000)
        libtorch_cpu.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so (0x00007fcd451e0000)
        libtorch_cuda.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so (0x00007fcd0f3cd000)
        libstdc++.so.6 => /opt/conda/lib/libstdc++.so.6 (0x00007fcd4ceff000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fcd0f02f000)
        libgcc_s.so.1 => /opt/conda/lib/libgcc_s.so.1 (0x00007fcd4cee5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fcd0ec3e000)
        libgomp.so.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libgomp.so.1 (0x00007fcd4ceb8000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fcd0ea1f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fcd4ce50000)
        libcudart.so.10.1 => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1 (0x00007fcd0e7a3000)
        libc10_cuda.so => not found
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fcd0e59f000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fcd0e397000)
        libmkl_intel_lp64.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_intel_lp64.so (0x00007fcd0d686000)
        libmkl_gnu_thread.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_gnu_thread.so (0x00007fcd0b95b000)
        libmkl_core.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_core.so (0x00007fcd07395000)
        libc10_cuda.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so (0x00007fcd07166000)
        libcusparse.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcusparse.so.10 (0x00007fccffedb000)
        libcurand.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcurand.so.10 (0x00007fccfbe79000)
        libcusolver.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcusolver.so.10 (0x00007fccf1368000)
        libnvToolsExt.so.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libnvToolsExt.so.1 (0x00007fccf115e000)
        libcufft.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcufft.so.10 (0x00007fcce8b22000)
        libcublas.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcublas.so.10 (0x00007fcce4d84000)
        libcublasLt.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../.././libcublasLt.so.10 (0x00007fcce2ede000)

Probably, this is for another issue to file.

@bmanga
Copy link
Contributor Author

bmanga commented Oct 16, 2020

@vfdev-5 No problem, glad it works :). I wonder if <torchvision/vision.h> should be indirectly included by the other headers by default. Either way it's interesting that test_frcnn_tracing works at all in the current CI, and I'm not yet sure why that is.

@fmassa thanks for the merge! I didn't get around applying the last suggestion by @ezyang but that can be done in a separate PR.

bryant1410 pushed a commit to bryant1410/vision-1 that referenced this pull request Nov 22, 2020
* Ensure torchvision operators are registered in C++ via weak symbols

* Add note to README on how to ensure that torchvision operators are available in C++

* Fix dllimport/dllexport on windows, format files

* Factor out common macros in single file

* Expose cuda_version in the API, use it to avoid pruning of ops initializer
@datumbox
Copy link
Contributor

datumbox commented Dec 2, 2020

@bmanga

Originally we thought that this PR would help us eliminate the need of the below:

#include <torchvision/nms.h>
#ifdef _WIN32
// Windows only
// This is necessary until operators are automatically registered on include
static auto _nms = &nms_cpu;
#endif

and replace it with #include <torchvision/vision.h>. Unfortunately that's not the case. Any ideas?

vfdev-5 pushed a commit to Quansight/vision that referenced this pull request Dec 4, 2020
* Ensure torchvision operators are registered in C++ via weak symbols

* Add note to README on how to ensure that torchvision operators are available in C++

* Fix dllimport/dllexport on windows, format files

* Factor out common macros in single file

* Expose cuda_version in the API, use it to avoid pruning of ops initializer
@bmanga
Copy link
Contributor Author

bmanga commented Dec 4, 2020

That's odd, it should work. I will have a look tomorrow.

@fmassa fmassa mentioned this pull request Dec 9, 2020
@datumbox
Copy link
Contributor

FYI @bmanga we have recently made a major refactoring on the C++ codebase but the problem still persists.

@fmassa
Copy link
Member

fmassa commented Feb 10, 2021

@bmanga FYI we are still having reports that the current approach is not enough on some systems to avoid the symbols to be stripped, and users are having to do things like this to get it working on all systems.

So I'll be moving forward with the approach proposed by @ezyang in #2134 (comment) and include --no-as-needed. Let me know if you have any concerns about this.

@bmanga
Copy link
Contributor Author

bmanga commented Feb 10, 2021

@datumbox @fmassa sorry about being MIA until now. If you give me until this evening I can try to figure out what is going on, otherwise feel free to adjust the cmake options.

@bmanga
Copy link
Contributor Author

bmanga commented Feb 10, 2021

@datumbox @fmassa Could you point me to some issues/code that are exposing the problems? I have been a bit out of the loop recently and I am finding it hard to find the relevant discussions.

@fmassa
Copy link
Member

fmassa commented Feb 10, 2021

@bmanga if you could have the time to look at this tonight it would be great, otherwise I'll try to push the --no-as-needed myself.

Here is one example where we need explicit linkage for things to work

#ifdef _WIN32
// Windows only
// This is necessary until operators are automatically registered on include
static auto _nms = &vision::ops::nms;
#endif

Without this line, CI fails on Windows, so just including torchvision is not enough.

Another example is in detectron2, where they had to add --no-as-needed in their CMakeLists otherwise it wouldn't work.

@bmanga
Copy link
Contributor Author

bmanga commented Feb 10, 2021

@fmassa I'm confused. --no-as-needed is a GNU ld option, but you are saying things are failing on windows. Does adding that option fix the windows build?

@fmassa
Copy link
Member

fmassa commented Feb 10, 2021

@bmanga I haven't tried that yet on torchvision, and I don't know in which systems torchvision ops were not being correctly picked by detectron2, @ppwwyyxx can you comment on that?

@bmanga
Copy link
Contributor Author

bmanga commented Feb 10, 2021

I managed to set up the windows environment and reproduce the error, but I couldn't find a fix today. I will investigate more tomorrow if you can wait a bit longer.

@ppwwyyxx
Copy link
Contributor

When I added the flag in detectron2 I was testing with the released torchvision at that time, which I think doesn't include this PR.

@bmanga
Copy link
Contributor Author

bmanga commented Feb 11, 2021

Tentative fix in #3380

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants