Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression for multiple @sync copyto! on CUDA v5 #2112

Closed
jeremiedb opened this issue Oct 13, 2023 · 3 comments · Fixed by #2143
Closed

Performance regression for multiple @sync copyto! on CUDA v5 #2112

jeremiedb opened this issue Oct 13, 2023 · 3 comments · Fixed by #2143
Labels
performance How fast can we go? regression Something that used to work, doesn't anymore.

Comments

@jeremiedb
Copy link

Describe the bug

Performance regression when performing multiple copyto! from GPU to CPU on CUDA v5.0.0 vs CUDA v4.4.1.

To reproduce

using CUDA, BenchmarkTools 
function gpu_copy!(h, h∇, jsc)
    CUDA.@sync blocking=true for j in jsc
        nbins = size(h[j], 2)
        copyto!(h[j], view(h∇, :, 1:nbins, j))
    end
    return nothing
end

h∇ = [zeros(Float32, 3, 32) for n in 1:100];
h∇_gpu = CUDA.zeros(Float32, 3, 32, 100);
js = 1:100

@btime gpu_copy!(h∇, h∇_gpu, js)
# CUDA v4: 534.480 μs (100 allocations: 4.69 KiB)
# CUDA v5: 1.203 ms (1600 allocations: 68.75 KiB)

The above regression can be partially alleviate by setting nonblocking_synchronization = false in LocalPreferences.toml. A remaining difference (but without material adverse effect) was greater allocations compared to CUDA v4. It was also mentionned that usage nonblocking_synchronization = false should be discouraged:

# with nonblocking_synchronization = false
@btime gpu_copy!(h∇, h∇_gpu, js)
# v5: 585.943 μs (1500 allocations: 67.19 KiB)

Note that the impact of the regression descibed here on the training time of a gradient boosted tree could be mitigated by adding a CPU allocation, resulting in a single copyto! operation of the full CuArray, and letting the multiple copy operations strictly on the CPU:

copyto!(h∇_cpu, h∇)
@threads for j in jsc
    nbins = size(h[j], 2)
    @views h[j] .= h∇_cpu[:, 1:nbins, j]
end

As such, this issue no longer has a material impact on EvoTrees.jl, so I can close if not deemed worth investigating further.

Version info

Details on Julia:

Julia Version 1.10.0-beta2
Commit a468aa198d0 (2023-08-17 06:27 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 21 on 24 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12

Details on CUDA:

CUDA runtime 12.2, artifact installation
CUDA driver 12.2
NVIDIA driver 535.113.1

CUDA libraries: 
- CUBLAS: 12.2.5
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.2
- CUSPARSE: 12.1.2
- CUPTI: 20.0.0
- NVML: 12.0.0+535.113.1

Julia packages: 
- CUDA: 5.0.0
- CUDA_Driver_jll: 0.6.0+3
- CUDA_Runtime_jll: 0.9.2+1

Toolchain:
- Julia: 1.10.0-beta2
- LLVM: 15.0.7
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA RTX A4000 (sm_86, 12.067 GiB / 15.992 GiB available)
@jeremiedb jeremiedb added the bug Something isn't working label Oct 13, 2023
@jeremiedb jeremiedb changed the title Performance regression for copyto! on CUDA v5 Performance regression for multiple @sync copyto! on CUDA v5 Oct 13, 2023
@maleadt maleadt added performance How fast can we go? regression Something that used to work, doesn't anymore. and removed bug Something isn't working labels Nov 1, 2023
@maleadt
Copy link
Member

maleadt commented Nov 1, 2023

#2143 should recover the performance; please verify.

@jeremiedb
Copy link
Author

I can confirm the fix solved the regression I had observed:

# v4: 1.123 ms (200 allocations: 6.25 KiB)
# v5.0: 9.399 ms (6724 allocations: 416.94 KiB)
# main: 1.161 ms (300 allocations: 9.38 KiB)

Thanks!

@maleadt
Copy link
Member

maleadt commented Nov 3, 2023

Great, thanks for confirming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance How fast can we go? regression Something that used to work, doesn't anymore.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants