Performance regression for multiple @sync copyto! on CUDA v5 #2112

jeremiedb · 2023-10-13T04:30:19Z

Describe the bug

Performance regression when performing multiple copyto! from GPU to CPU on CUDA v5.0.0 vs CUDA v4.4.1.

To reproduce

using CUDA, BenchmarkTools 
function gpu_copy!(h, h∇, jsc)
    CUDA.@sync blocking=true for j in jsc
        nbins = size(h[j], 2)
        copyto!(h[j], view(h∇, :, 1:nbins, j))
    end
    return nothing
end

h∇ = [zeros(Float32, 3, 32) for n in 1:100];
h∇_gpu = CUDA.zeros(Float32, 3, 32, 100);
js = 1:100

@btime gpu_copy!(h∇, h∇_gpu, js)
# CUDA v4: 534.480 μs (100 allocations: 4.69 KiB)
# CUDA v5: 1.203 ms (1600 allocations: 68.75 KiB)

The above regression can be partially alleviate by setting nonblocking_synchronization = false in LocalPreferences.toml. A remaining difference (but without material adverse effect) was greater allocations compared to CUDA v4. It was also mentionned that usage nonblocking_synchronization = false should be discouraged:

# with nonblocking_synchronization = false
@btime gpu_copy!(h∇, h∇_gpu, js)
# v5: 585.943 μs (1500 allocations: 67.19 KiB)

Note that the impact of the regression descibed here on the training time of a gradient boosted tree could be mitigated by adding a CPU allocation, resulting in a single copyto! operation of the full CuArray, and letting the multiple copy operations strictly on the CPU:

copyto!(h∇_cpu, h∇)
@threads for j in jsc
    nbins = size(h[j], 2)
    @views h[j] .= h∇_cpu[:, 1:nbins, j]
end

As such, this issue no longer has a material impact on EvoTrees.jl, so I can close if not deemed worth investigating further.

Version info

Details on Julia:

Julia Version 1.10.0-beta2
Commit a468aa198d0 (2023-08-17 06:27 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 21 on 24 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12

Details on CUDA:

CUDA runtime 12.2, artifact installation
CUDA driver 12.2
NVIDIA driver 535.113.1

CUDA libraries: 
- CUBLAS: 12.2.5
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.2
- CUSPARSE: 12.1.2
- CUPTI: 20.0.0
- NVML: 12.0.0+535.113.1

Julia packages: 
- CUDA: 5.0.0
- CUDA_Driver_jll: 0.6.0+3
- CUDA_Runtime_jll: 0.9.2+1

Toolchain:
- Julia: 1.10.0-beta2
- LLVM: 15.0.7
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA RTX A4000 (sm_86, 12.067 GiB / 15.992 GiB available)

The text was updated successfully, but these errors were encountered:

maleadt · 2023-11-01T20:40:23Z

#2143 should recover the performance; please verify.

jeremiedb · 2023-11-02T21:25:50Z

I can confirm the fix solved the regression I had observed:

# v4: 1.123 ms (200 allocations: 6.25 KiB)
# v5.0: 9.399 ms (6724 allocations: 416.94 KiB)
# main: 1.161 ms (300 allocations: 9.38 KiB)

Thanks!

maleadt · 2023-11-03T14:58:26Z

Great, thanks for confirming!

jeremiedb added the bug Something isn't working label Oct 13, 2023

jeremiedb changed the title ~~Performance regression for copyto! on CUDA v5~~ Performance regression for multiple @sync copyto! on CUDA v5 Oct 13, 2023

maleadt added performance How fast can we go? regression Something that used to work, doesn't anymore. and removed bug Something isn't working labels Nov 1, 2023

This was referenced Nov 1, 2023

Avoid allocations during derived array construction. #2142

Merged

More performance tweaks for memory copying #2143

Merged

maleadt closed this as completed in #2143 Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression for multiple @sync copyto! on CUDA v5 #2112

Performance regression for multiple @sync copyto! on CUDA v5 #2112

jeremiedb commented Oct 13, 2023

maleadt commented Nov 1, 2023

jeremiedb commented Nov 2, 2023

maleadt commented Nov 3, 2023

Performance regression for multiple @sync copyto! on CUDA v5 #2112

Performance regression for multiple @sync copyto! on CUDA v5 #2112

Comments

jeremiedb commented Oct 13, 2023

maleadt commented Nov 1, 2023

jeremiedb commented Nov 2, 2023

maleadt commented Nov 3, 2023