You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Performance regression when performing multiple copyto! from GPU to CPU on CUDA v5.0.0 vs CUDA v4.4.1.
To reproduce
using CUDA, BenchmarkTools
functiongpu_copy!(h, h∇, jsc)
CUDA.@sync blocking=truefor j in jsc
nbins =size(h[j], 2)
copyto!(h[j], view(h∇, :, 1:nbins, j))
endreturnnothingend
h∇ = [zeros(Float32, 3, 32) for n in1:100];
h∇_gpu = CUDA.zeros(Float32, 3, 32, 100);
js =1:100@btimegpu_copy!(h∇, h∇_gpu, js)
# CUDA v4: 534.480 μs (100 allocations: 4.69 KiB)# CUDA v5: 1.203 ms (1600 allocations: 68.75 KiB)
The above regression can be partially alleviate by setting nonblocking_synchronization = false in LocalPreferences.toml. A remaining difference (but without material adverse effect) was greater allocations compared to CUDA v4. It was also mentionned that usage nonblocking_synchronization = false should be discouraged:
Note that the impact of the regression descibed here on the training time of a gradient boosted tree could be mitigated by adding a CPU allocation, resulting in a single copyto! operation of the full CuArray, and letting the multiple copy operations strictly on the CPU:
copyto!(h∇_cpu, h∇)
@threadsfor j in jsc
nbins =size(h[j], 2)
@views h[j] .= h∇_cpu[:, 1:nbins, j]
end
As such, this issue no longer has a material impact on EvoTrees.jl, so I can close if not deemed worth investigating further.
Version info
Details on Julia:
Julia Version 1.10.0-beta2
Commit a468aa198d0 (2023-08-17 06:27 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 21 on 24 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 12
Describe the bug
Performance regression when performing multiple
copyto!
from GPU to CPU on CUDA v5.0.0 vs CUDA v4.4.1.To reproduce
The above regression can be partially alleviate by setting
nonblocking_synchronization = false
inLocalPreferences.toml
. A remaining difference (but without material adverse effect) was greater allocations compared to CUDA v4. It was also mentionned that usagenonblocking_synchronization = false
should be discouraged:Note that the impact of the regression descibed here on the training time of a gradient boosted tree could be mitigated by adding a CPU allocation, resulting in a single
copyto!
operation of the full CuArray, and letting the multiple copy operations strictly on the CPU:As such, this issue no longer has a material impact on EvoTrees.jl, so I can close if not deemed worth investigating further.
Version info
Details on Julia:
Details on CUDA:
The text was updated successfully, but these errors were encountered: