Compile using the toolkit, not the driver. #892

maleadt · 2021-05-07T06:24:00Z

In response to the driver's compiler bug in #891, which doesn't manifest on 11.3's ptxas. Should also enable #832 without having to use libnvptxcompiler. Might also help towards fixing #812, because if we additionally expose libnvvm, we can do LTO with nvlink (this is not exposed by the linker API).

Currently stuck on https://forums.developer.nvidia.com/t/manually-perform-separate-compilation-with-ptxas-and-nvlink/177183, I can't find how to use nvlink to link libcudadevrt.

Artifacts have not been adapted, so this only works using `JULIA_CUDA_USE_BINARYBUILDER=false. But it does pass all tests, except for the ones requiring the device runtime .

maleadt · 2021-05-07T07:40:32Z

Not too much slower:

julia> using BenchmarkTools

julia> using CUDA
[ Info: Precompiling CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]

julia> kernel() = nothing
kernel (generic function with 1 method)

julia> @cuda kernel()
CUDA.HostKernel{typeof(kernel), Tuple{}}(kernel, CuContext(0x0000000002d01a60, instance 8571ef5868491e3a), CuModule(Ptr{Nothing} @0x00000000050624c0, CuContext(0x0000000002d01a60, instance 8571ef5868491e3a)), CuFunction(Ptr{Nothing} @0x000000000524ddb0, CuModule(Ptr{Nothing} @0x00000000050624c0, CuContext(0x0000000002d01a60, instance 8571ef5868491e3a))))

julia> @benchmark begin empty!(CUDA.cufunction_cache[device()]); cufunction(kernel, Tuple{}) end
BenchmarkTools.Trial: 
  memory estimate:  1.15 MiB
  allocs estimate:  1144
  --------------
  minimum time:     41.905 ms (0.00% GC)
  median time:      44.220 ms (0.00% GC)
  mean time:        45.104 ms (0.07% GC)
  maximum time:     68.945 ms (0.00% GC)
  --------------
  samples:          111
  evals/sample:     1

Whereas before:

julia> @benchmark begin empty!(CUDA.cufunction_cache[device()]); cufunction(kernel, Tuple{}) end
BenchmarkTools.Trial: 
  memory estimate:  1.07 MiB
  allocs estimate:  963
  --------------
  minimum time:     39.197 ms (0.00% GC)
  median time:      40.030 ms (0.00% GC)
  mean time:        40.669 ms (0.05% GC)
  maximum time:     54.726 ms (0.00% GC)
  --------------
  samples:          123
  evals/sample:     1

maleadt · 2021-05-07T14:03:23Z

CI times seem to suggest this does hurt compile-time performance...

codecov · 2021-05-07T14:24:23Z

Codecov Report

Merging #892 (8bca8f9) into master (eb7c326) will decrease coverage by 0.67%.
The diff coverage is 93.18%.

❗ Current head 8bca8f9 differs from pull request most recent head 69cf7ec. Consider uploading reports for the commit 69cf7ec to get more accurate results

@@            Coverage Diff             @@
##           master     #892      +/-   ##
==========================================
- Coverage   77.00%   76.32%   -0.68%     
==========================================
  Files         121      121              
  Lines        7706     7811     +105     
==========================================
+ Hits         5934     5962      +28     
- Misses       1772     1849      +77

Impacted Files	Coverage Δ
src/compiler/execution.jl	`89.50% <93.18%> (+0.78%)`	⬆️
src/reverse.jl	`69.49% <0.00%> (-30.51%)`	⬇️
src/indexing.jl	`76.38% <0.00%> (-21.83%)`	⬇️
lib/cudnn/dropout.jl	`87.50% <0.00%> (-12.50%)`	⬇️
lib/cublas/util.jl	`82.22% <0.00%> (-8.03%)`	⬇️
lib/cusolver/base.jl	`52.63% <0.00%> (-6.20%)`	⬇️
lib/cudnn/softmax.jl	`69.23% <0.00%> (-5.77%)`	⬇️
lib/cudnn/activation.jl	`71.42% <0.00%> (-5.50%)`	⬇️
lib/cudnn/optensor.jl	`76.47% <0.00%> (-4.78%)`	⬇️
lib/cusolver/linalg.jl	`85.24% <0.00%> (-2.89%)`	⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb7c326...69cf7ec. Read the comment docs.

maleadt · 2021-05-10T11:08:55Z

There still looks to be some difference, about 10% in compile time, but that seems to be caused by ptxas not using ~/.nv/computecache. I don't think that's a reason to not do this change, and we can always implement our own cache if that turns out to be a problem.

maleadt · 2021-05-10T15:58:04Z

I'm still bothered by the performance hit... I'm first going to add some timers to master to report on the current compilation times before blindly merging this.

maleadt · 2021-05-28T13:45:24Z

Turns out argument interpolation is ridiculously expensive to infer...

That whole additional inference subtree on the left is entirely coming from argument interpolation...

maleadt · 2021-05-28T14:10:23Z

Avoiding cmd-in-cmd interpolation seems to make inference happy, and with it I think this PR is basically at performance parity 🎉

Maybe some context why this is exciting: if we can use ptxas from the toolkit, not only will it be easier to upgrade the compiler, it'll also make it possible to fix #832 and use newer CUDA toolkits semver-style (e.g. 11.3 on a driver only supporting 11.0).

maleadt added the cuda kernels Stuff about writing CUDA kernels. label May 7, 2021

maleadt force-pushed the tb/ptxas branch from e3b0a7f to 3e36c9d Compare May 7, 2021 07:40

maleadt mentioned this pull request May 7, 2021

Enable location and debug info. #891

Merged

maleadt force-pushed the tb/ptxas branch 4 times, most recently from 12795c7 to 2df1a9f Compare May 7, 2021 12:13

maleadt marked this pull request as ready for review May 7, 2021 12:59

maleadt force-pushed the tb/ptxas branch from 6166824 to d9a9a81 Compare May 10, 2021 10:04

maleadt force-pushed the tb/ptxas branch from d9a9a81 to 80b4084 Compare May 10, 2021 11:10

maleadt marked this pull request as draft May 10, 2021 15:58

maleadt force-pushed the tb/ptxas branch 4 times, most recently from 0b7331c to 62cb88e Compare May 12, 2021 08:23

maleadt force-pushed the tb/ptxas branch from 62cb88e to fd73062 Compare May 27, 2021 11:38

Compile using the toolkit, not the driver.

d42be6f

maleadt force-pushed the tb/ptxas branch 2 times, most recently from 8bca8f9 to 69cf7ec Compare May 28, 2021 13:13

Don't specialize cufunction to improve latency.

3f68ad6

maleadt force-pushed the tb/ptxas branch from 69cf7ec to 3f68ad6 Compare May 28, 2021 13:45

Simplify command interpolation.

774846e

maleadt marked this pull request as ready for review May 28, 2021 15:56

maleadt merged commit 83c8dfc into master May 28, 2021

maleadt deleted the tb/ptxas branch May 28, 2021 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile using the toolkit, not the driver. #892

Compile using the toolkit, not the driver. #892

maleadt commented May 7, 2021 •

edited

Loading

maleadt commented May 7, 2021

maleadt commented May 7, 2021

codecov bot commented May 7, 2021 •

edited

Loading

maleadt commented May 10, 2021

maleadt commented May 10, 2021

maleadt commented May 28, 2021

maleadt commented May 28, 2021

Compile using the toolkit, not the driver. #892

Compile using the toolkit, not the driver. #892

Conversation

maleadt commented May 7, 2021 • edited Loading

maleadt commented May 7, 2021

maleadt commented May 7, 2021

codecov bot commented May 7, 2021 • edited Loading

Codecov Report

maleadt commented May 10, 2021

maleadt commented May 10, 2021

maleadt commented May 28, 2021

maleadt commented May 28, 2021

maleadt commented May 7, 2021 •

edited

Loading

codecov bot commented May 7, 2021 •

edited

Loading