-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure torchvision operators are added in C++ #2798
Conversation
@ezyang can you have a look at this PR? For more context on this PR, see #2796 (comment) |
3343df4
to
f205585
Compare
@bmanga I was playing around that in a separate repository and there can be a solution to split declaration/definition of the function that call the dispatcher:
Drawback is that we have to add another cpp file. What do you think ? |
@vfdev-5 I don't think that solves the problem of the registration done by the static variable created by the TORCH_LIBRARY macro, because within vision.cpp it's still never referenced. |
f205585
to
0044787
Compare
I see more or less what you mean. Can't explain why it links with my example... |
@vfdev-5 What would happen if it's included twice? |
@bmanga yes, seems like it is not a good idea to use it in the header. |
In the old issue I stated I was OK with trying out the inline variable approach #2134 which solves the problem of including the header multiple times in different compilation units. However, I feel that this PR is getting a bit confused about which problem it is trying to solve. Remember there are two distinct pruning problems:
I think only (1) is the most pressing, but you don't actually need anything interesting in https://github.com/pytorch/vision/pull/2798/files#diff-99d0948938535154c28f8a0d6a7ca8b3bbdeae5effcafefffae0531a9f7169adR91 to keep libtorchvision.so; literally ANY symbol will do. The other thing that I am most concerned about is whether or not this works in all situations we care about. So it will be good to explicate a comprehensive testing strategy that includes Linux, OS X and Windows, as well as a the variety of compiler versions we support. |
I agree with Ed on having good CI testing for this. We have added a cmake-specific CI for OSX / Linux / Windows in #2577 , but we can extend it further for different configurations if needed |
@ezyang Yes that's true, those are two different problems (I usually also test with a static version of the library). @fmassa I have added a PR (#2807) to test successful registration in C++. |
Sure. But we can probably do something simpler. IIRC, if we have a reference to any symbol in the object file that does the registration, that's sufficient to retain it (and its static initializers). So you don't have to grovel in the internal symbols. |
@ezyang I tested it without the reference and it seems to work fine. I'm just going to just use cuda_version then. |
0044787
to
f8b3c7d
Compare
Codecov Report
@@ Coverage Diff @@
## master #2798 +/- ##
==========================================
- Coverage 73.31% 73.29% -0.03%
==========================================
Files 99 99
Lines 8724 9181 +457
Branches 1373 1491 +118
==========================================
+ Hits 6396 6729 +333
- Misses 1909 2028 +119
- Partials 419 424 +5
Continue to review full report at Codecov.
|
// Dummy variable to reference a symbol from vision.cpp. | ||
// This ensures that the torchvision library and the ops registration | ||
// initializers are not pruned. | ||
VISION_INLINE_VARIABLE int64_t _cuda_version = cuda_version(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be better to store a reference to the function pointer, rather than doing a full static initializer (which will bang on cuda version even though there's no reason to do so.)
@bmanga I'm still getting the issue with libtorchvision.so being pruned while testing torchvision master with this PR inside.
Any ideas ? PS: I'm inside |
@vfdev-5 can you try including |
@bmanga same even with including |
@bmanga sorry, my bad. Inserted into another file, not the one I builded. In this case, ldd test_frcnn_tracing
linux-vdso.so.1 (0x00007fffab186000)
libc10.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so (0x00007fcd4c92b000)
libtorchvision.so => /usr/local/lib/libtorchvision.so (0x00007fcd4c2dc000)
libtorch.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so (0x00007fcd4c0c8000)
libtorch_cpu.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so (0x00007fcd451e0000)
libtorch_cuda.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so (0x00007fcd0f3cd000)
libstdc++.so.6 => /opt/conda/lib/libstdc++.so.6 (0x00007fcd4ceff000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fcd0f02f000)
libgcc_s.so.1 => /opt/conda/lib/libgcc_s.so.1 (0x00007fcd4cee5000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fcd0ec3e000)
libgomp.so.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libgomp.so.1 (0x00007fcd4ceb8000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fcd0ea1f000)
/lib64/ld-linux-x86-64.so.2 (0x00007fcd4ce50000)
libcudart.so.10.1 => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1 (0x00007fcd0e7a3000)
libc10_cuda.so => not found
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fcd0e59f000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fcd0e397000)
libmkl_intel_lp64.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_intel_lp64.so (0x00007fcd0d686000)
libmkl_gnu_thread.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_gnu_thread.so (0x00007fcd0b95b000)
libmkl_core.so => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libmkl_core.so (0x00007fcd07395000)
libc10_cuda.so => /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so (0x00007fcd07166000)
libcusparse.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcusparse.so.10 (0x00007fccffedb000)
libcurand.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcurand.so.10 (0x00007fccfbe79000)
libcusolver.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcusolver.so.10 (0x00007fccf1368000)
libnvToolsExt.so.1 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libnvToolsExt.so.1 (0x00007fccf115e000)
libcufft.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcufft.so.10 (0x00007fcce8b22000)
libcublas.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcublas.so.10 (0x00007fcce4d84000)
libcublasLt.so.10 => /opt/conda/lib/python3.7/site-packages/torch/lib/../../../.././libcublasLt.so.10 (0x00007fcce2ede000) Probably, this is for another issue to file. |
@vfdev-5 No problem, glad it works :). I wonder if @fmassa thanks for the merge! I didn't get around applying the last suggestion by @ezyang but that can be done in a separate PR. |
* Ensure torchvision operators are registered in C++ via weak symbols * Add note to README on how to ensure that torchvision operators are available in C++ * Fix dllimport/dllexport on windows, format files * Factor out common macros in single file * Expose cuda_version in the API, use it to avoid pruning of ops initializer
Originally we thought that this PR would help us eliminate the need of the below: vision/test/tracing/frcnn/test_frcnn_tracing.cpp Lines 6 to 12 in a075d62
and replace it with |
* Ensure torchvision operators are registered in C++ via weak symbols * Add note to README on how to ensure that torchvision operators are available in C++ * Fix dllimport/dllexport on windows, format files * Factor out common macros in single file * Expose cuda_version in the API, use it to avoid pruning of ops initializer
That's odd, it should work. I will have a look tomorrow. |
FYI @bmanga we have recently made a major refactoring on the C++ codebase but the problem still persists. |
@bmanga FYI we are still having reports that the current approach is not enough on some systems to avoid the symbols to be stripped, and users are having to do things like this to get it working on all systems. So I'll be moving forward with the approach proposed by @ezyang in #2134 (comment) and include |
@bmanga if you could have the time to look at this tonight it would be great, otherwise I'll try to push the Here is one example where we need explicit linkage for things to work vision/test/tracing/frcnn/test_frcnn_tracing.cpp Lines 6 to 10 in 1b7c0f5
Without this line, CI fails on Windows, so just including torchvision is not enough. Another example is in detectron2, where they had to add |
@fmassa I'm confused. --no-as-needed is a GNU ld option, but you are saying things are failing on windows. Does adding that option fix the windows build? |
I managed to set up the windows environment and reproduce the error, but I couldn't find a fix today. I will investigate more tomorrow if you can wait a bit longer. |
When I added the flag in detectron2 I was testing with the released torchvision at that time, which I think doesn't include this PR. |
Tentative fix in #3380 |
The torchvision operators are now registered with the TORCH_LIBRARY macro. However, the macro relies on generating a static variable for the actual registration. This static variable is however removed by the linker because it is never referenced, preventing the actual registration.
This PR adds a reference to the created variable through a function that is used to initialize a dummy inline variable (to avoid redefintion errors). The approach is the same as the old PR #2253.
This breaks the TORCH_LIBRARY macro abstraction, but gets the C++ ops registered.