Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile using the toolkit, not the driver. #892

Merged
merged 3 commits into from
May 28, 2021
Merged

Compile using the toolkit, not the driver. #892

merged 3 commits into from
May 28, 2021

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented May 7, 2021

In response to the driver's compiler bug in #891, which doesn't manifest on 11.3's ptxas. Should also enable #832 without having to use libnvptxcompiler. Might also help towards fixing #812, because if we additionally expose libnvvm, we can do LTO with nvlink (this is not exposed by the linker API).

Currently stuck on https://forums.developer.nvidia.com/t/manually-perform-separate-compilation-with-ptxas-and-nvlink/177183, I can't find how to use nvlink to link libcudadevrt.

Artifacts have not been adapted, so this only works using `JULIA_CUDA_USE_BINARYBUILDER=false. But it does pass all tests, except for the ones requiring the device runtime .

@maleadt maleadt added the cuda kernels Stuff about writing CUDA kernels. label May 7, 2021
@maleadt
Copy link
Member Author

maleadt commented May 7, 2021

Not too much slower:

julia> using BenchmarkTools

julia> using CUDA
[ Info: Precompiling CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]

julia> kernel() = nothing
kernel (generic function with 1 method)

julia> @cuda kernel()
CUDA.HostKernel{typeof(kernel), Tuple{}}(kernel, CuContext(0x0000000002d01a60, instance 8571ef5868491e3a), CuModule(Ptr{Nothing} @0x00000000050624c0, CuContext(0x0000000002d01a60, instance 8571ef5868491e3a)), CuFunction(Ptr{Nothing} @0x000000000524ddb0, CuModule(Ptr{Nothing} @0x00000000050624c0, CuContext(0x0000000002d01a60, instance 8571ef5868491e3a))))

julia> @benchmark begin empty!(CUDA.cufunction_cache[device()]); cufunction(kernel, Tuple{}) end
BenchmarkTools.Trial: 
  memory estimate:  1.15 MiB
  allocs estimate:  1144
  --------------
  minimum time:     41.905 ms (0.00% GC)
  median time:      44.220 ms (0.00% GC)
  mean time:        45.104 ms (0.07% GC)
  maximum time:     68.945 ms (0.00% GC)
  --------------
  samples:          111
  evals/sample:     1

Whereas before:

julia> @benchmark begin empty!(CUDA.cufunction_cache[device()]); cufunction(kernel, Tuple{}) end
BenchmarkTools.Trial: 
  memory estimate:  1.07 MiB
  allocs estimate:  963
  --------------
  minimum time:     39.197 ms (0.00% GC)
  median time:      40.030 ms (0.00% GC)
  mean time:        40.669 ms (0.05% GC)
  maximum time:     54.726 ms (0.00% GC)
  --------------
  samples:          123
  evals/sample:     1

@maleadt maleadt force-pushed the tb/ptxas branch 4 times, most recently from 12795c7 to 2df1a9f Compare May 7, 2021 12:13
@maleadt maleadt marked this pull request as ready for review May 7, 2021 12:59
@maleadt
Copy link
Member Author

maleadt commented May 7, 2021

CI times seem to suggest this does hurt compile-time performance...

@codecov
Copy link

codecov bot commented May 7, 2021

Codecov Report

Merging #892 (8bca8f9) into master (eb7c326) will decrease coverage by 0.67%.
The diff coverage is 93.18%.

❗ Current head 8bca8f9 differs from pull request most recent head 69cf7ec. Consider uploading reports for the commit 69cf7ec to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #892      +/-   ##
==========================================
- Coverage   77.00%   76.32%   -0.68%     
==========================================
  Files         121      121              
  Lines        7706     7811     +105     
==========================================
+ Hits         5934     5962      +28     
- Misses       1772     1849      +77     
Impacted Files Coverage Δ
src/compiler/execution.jl 89.50% <93.18%> (+0.78%) ⬆️
src/reverse.jl 69.49% <0.00%> (-30.51%) ⬇️
src/indexing.jl 76.38% <0.00%> (-21.83%) ⬇️
lib/cudnn/dropout.jl 87.50% <0.00%> (-12.50%) ⬇️
lib/cublas/util.jl 82.22% <0.00%> (-8.03%) ⬇️
lib/cusolver/base.jl 52.63% <0.00%> (-6.20%) ⬇️
lib/cudnn/softmax.jl 69.23% <0.00%> (-5.77%) ⬇️
lib/cudnn/activation.jl 71.42% <0.00%> (-5.50%) ⬇️
lib/cudnn/optensor.jl 76.47% <0.00%> (-4.78%) ⬇️
lib/cusolver/linalg.jl 85.24% <0.00%> (-2.89%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb7c326...69cf7ec. Read the comment docs.

@maleadt
Copy link
Member Author

maleadt commented May 10, 2021

There still looks to be some difference, about 10% in compile time, but that seems to be caused by ptxas not using ~/.nv/computecache. I don't think that's a reason to not do this change, and we can always implement our own cache if that turns out to be a problem.

@maleadt
Copy link
Member Author

maleadt commented May 10, 2021

I'm still bothered by the performance hit... I'm first going to add some timers to master to report on the current compilation times before blindly merging this.

@maleadt maleadt marked this pull request as draft May 10, 2021 15:58
@maleadt maleadt force-pushed the tb/ptxas branch 4 times, most recently from 0b7331c to 62cb88e Compare May 12, 2021 08:23
@maleadt maleadt force-pushed the tb/ptxas branch 2 times, most recently from 8bca8f9 to 69cf7ec Compare May 28, 2021 13:13
@maleadt
Copy link
Member Author

maleadt commented May 28, 2021

Turns out argument interpolation is ridiculously expensive to infer...
image
That whole additional inference subtree on the left is entirely coming from argument interpolation...

@maleadt
Copy link
Member Author

maleadt commented May 28, 2021

Avoiding cmd-in-cmd interpolation seems to make inference happy, and with it I think this PR is basically at performance parity 🎉

Maybe some context why this is exciting: if we can use ptxas from the toolkit, not only will it be easier to upgrade the compiler, it'll also make it possible to fix #832 and use newer CUDA toolkits semver-style (e.g. 11.3 on a driver only supporting 11.0).

@maleadt maleadt marked this pull request as ready for review May 28, 2021 15:56
@maleadt maleadt merged commit 83c8dfc into master May 28, 2021
@maleadt maleadt deleted the tb/ptxas branch May 28, 2021 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant