-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add an index typevar to CuDeviceArray. #1895
base: master
Are you sure you want to change the base?
Conversation
254bc7e
to
db2723f
Compare
Debugging breadcrumb:
|
I was sceptical that this change would do much without a thorough pass over all of Base, so I did a test using this PR + the LLVM.jl and GPUCompiler.jl PRs above on Broadcast (which does a whole lot of 64-bit integer stuff): Reference:
This PR:
So not a spectacular reduction, but better than I expected nontheless. KA.jl-heavy code probably would benefit much more (unless KA.jl itself assumes Int64) |
Currently it does, but we can change that. |
@maleadt What you did has already a drastic impact on some kernels! 👍 The following example requires half the amount of registers now (improving from 21 to 10 registers): using CUDA
function copy3D!(T2, T, Ci)
ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x
iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y
iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z
@inbounds T2[ix-1,iy-2,iz+1] = 3.4
@inbounds T[ix-1,iy-2,iz+1] = 3.4
return
end
T = CUDA.zeros(Float64, 2,2,2);
T2 = CUDA.zeros(Float64, 2,2,2);
Ci = CUDA.zeros(Float64, 2,2,2);
@device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
kernel = @cuda launch=false copy3D!(T2, T, Ci)
@show CUDA.registers(kernel);
@show CUDA.memory(kernel); |
@maleadt When i remove the UInt32 casting in the above code, then it introduces some i64 operations. Here is the code: using CUDA
function copy3D!(T2, T, Ci)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
@inbounds T2[ix-1,iy-2,iz+1] = 3.4
@inbounds T[ix-1,iy-2,iz+1] = 3.4
return
end
T = CUDA.zeros(Float64, 2,2,2);
T2 = CUDA.zeros(Float64, 2,2,2);
Ci = CUDA.zeros(Float64, 2,2,2);
@device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
kernel = @cuda launch=false copy3D!(T2, T, Ci)
@show CUDA.registers(kernel);
@show CUDA.memory(kernel); And here is the output: julia> using CUDA
julia> function copy3D!(T2, T, Ci)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
@inbounds T2[ix-1,iy-2,iz+1] = 3.4
@inbounds T[ix-1,iy-2,iz+1] = 3.4
return
end
copy3D! (generic function with 1 method)
julia> T = CUDA.zeros(Float64, 2,2,2);
julia> T2 = CUDA.zeros(Float64, 2,2,2);
julia> Ci = CUDA.zeros(Float64, 2,2,2);
julia> @device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
; PTX CompilerJob of MethodInstance for copy3D!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}) for sm_60
define ptx_kernel void @_Z7copy3D_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
%.fca.0.extract12 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
%.fca.2.0.extract14 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
%.fca.2.1.extract15 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
%.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
%.fca.2.0.extract3 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
%.fca.2.1.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
%3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%4 = zext i32 %3 to i64
%5 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%6 = zext i32 %5 to i64
%7 = mul nuw nsw i64 %6, %4
%8 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%9 = add nuw nsw i32 %8, 1
%10 = zext i32 %9 to i64
%11 = add nuw nsw i64 %7, %10
%12 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
%13 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
%narrow = mul nuw nsw i32 %13, %12
%14 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
%15 = add nuw nsw i32 %14, 1
%narrow26 = add nuw nsw i32 %15, %narrow
%16 = zext i32 %narrow26 to i64
%17 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
%18 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
%narrow27 = mul nuw nsw i32 %18, %17
%19 = call i32 @llvm.nvvm.read.ptx.sreg.tid.z()
%20 = add nuw nsw i32 %19, 1
%narrow28 = add nuw nsw i32 %20, %narrow27
%21 = zext i32 %narrow28 to i64
%22 = icmp sgt i32 %.fca.2.0.extract14, 0
%23 = select i1 %22, i32 %.fca.2.0.extract14, i32 0
%24 = icmp sgt i32 %.fca.2.1.extract15, 0
%25 = select i1 %24, i32 %.fca.2.1.extract15, i32 0
%26 = zext i32 %23 to i64
%27 = zext i32 %25 to i64
%28 = add nsw i64 %16, -3
%29 = mul nuw nsw i64 %21, %27
%reass.add = add nsw i64 %28, %29
%reass.mul = mul i64 %reass.add, %26
%30 = add nuw nsw i64 %11, 4294967295
%31 = add i64 %30, %reass.mul
%32 = bitcast i8 addrspace(1)* %.fca.0.extract12 to double addrspace(1)*
%33 = trunc i64 %31 to i32
%34 = add i32 %33, -1
%35 = getelementptr inbounds double, double addrspace(1)* %32, i32 %34
store double 3.400000e+00, double addrspace(1)* %35, align 8
%36 = icmp sgt i32 %.fca.2.0.extract3, 0
%37 = select i1 %36, i32 %.fca.2.0.extract3, i32 0
%38 = icmp sgt i32 %.fca.2.1.extract4, 0
%39 = select i1 %38, i32 %.fca.2.1.extract4, i32 0
%40 = zext i32 %37 to i64
%41 = zext i32 %39 to i64
%42 = mul nuw nsw i64 %21, %41
%reass.add29 = add nsw i64 %28, %42
%reass.mul30 = mul i64 %reass.add29, %40
%43 = add i64 %30, %reass.mul30
%44 = bitcast i8 addrspace(1)* %.fca.0.extract1 to double addrspace(1)*
%45 = trunc i64 %43 to i32
%46 = add i32 %45, -1
%47 = getelementptr inbounds double, double addrspace(1)* %44, i32 %46
store double 3.400000e+00, double addrspace(1)* %47, align 8
ret void
}
julia> kernel = @cuda launch=false copy3D!(T2, T, Ci)
CUDA.HostKernel{typeof(copy3D!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}}}(copy3D!, CuFunction(Ptr{CUDA.CUfunc_st} @0x0000000005f1de40, CuModule(Ptr{CUDA.CUmod_st} @0x0000000005f0fac0, CuContext(0x0000000001401640, instance a86b98eac7129536))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4a00000))
julia> @show CUDA.registers(kernel);
CUDA.registers(kernel) = 12
julia> @show CUDA.memory(kernel);
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0) |
Well, yeah, because you're introducing |
I originally did expect literals to be treated as Int64. However, as in the first example there are no Int64s introduced (see below), I thought you had taken care of this somehow to some extent. If there is no simple solution to that, I don't think having to cast literals is a priority problem to solve now. Here is the LLVM code of the first example (note that there are no Int64 introduced, even though we have uncasted literals in the array assignments): julia> using CUDA
julia> function copy3D!(T2, T, Ci)
ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x
iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y
iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z
@inbounds T2[ix-1,iy-2,iz+1] = 3.4
@inbounds T[ix-1,iy-2,iz+1] = 3.4
return
end
copy3D! (generic function with 2 methods)
julia> T = CUDA.zeros(Float64, 2,2,2);
julia> T2 = CUDA.zeros(Float64, 2,2,2);
julia> Ci = CUDA.zeros(Float64, 2,2,2);
julia> @device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
; PTX CompilerJob of MethodInstance for copy3D!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}) for sm_60
define ptx_kernel void @_Z7copy3D_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
%.fca.0.extract12 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
%.fca.2.0.extract14 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
%.fca.2.1.extract15 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
%.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
%.fca.2.0.extract3 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
%.fca.2.1.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
%3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%4 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%5 = mul i32 %4, %3
%6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%7 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
%8 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
%9 = mul nuw nsw i32 %8, %7
%10 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
%11 = add nuw nsw i32 %10, 1
%12 = add nuw nsw i32 %11, %9
%13 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
%14 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
%15 = mul nuw nsw i32 %14, %13
%16 = call i32 @llvm.nvvm.read.ptx.sreg.tid.z()
%17 = add nuw nsw i32 %16, 1
%18 = add nuw nsw i32 %17, %15
%19 = icmp sgt i32 %.fca.2.0.extract14, 0
%20 = select i1 %19, i32 %.fca.2.0.extract14, i32 0
%21 = icmp sgt i32 %.fca.2.1.extract15, 0
%22 = select i1 %21, i32 %.fca.2.1.extract15, i32 0
%23 = add nsw i32 %12, -3
%24 = mul i32 %18, %22
%reass.add = add i32 %23, %24
%reass.mul = mul i32 %reass.add, %20
%25 = add i32 %5, %6
%26 = add i32 %25, -1
%27 = add i32 %26, %reass.mul
%28 = bitcast i8 addrspace(1)* %.fca.0.extract12 to double addrspace(1)*
%29 = getelementptr inbounds double, double addrspace(1)* %28, i32 %27
store double 3.400000e+00, double addrspace(1)* %29, align 8
%30 = icmp sgt i32 %.fca.2.0.extract3, 0
%31 = select i1 %30, i32 %.fca.2.0.extract3, i32 0
%32 = icmp sgt i32 %.fca.2.1.extract4, 0
%33 = select i1 %32, i32 %.fca.2.1.extract4, i32 0
%34 = mul i32 %18, %33
%reass.add26 = add i32 %23, %34
%reass.mul27 = mul i32 %reass.add26, %31
%35 = add i32 %26, %reass.mul27
%36 = bitcast i8 addrspace(1)* %.fca.0.extract1 to double addrspace(1)*
%37 = getelementptr inbounds double, double addrspace(1)* %36, i32 %35
store double 3.400000e+00, double addrspace(1)* %37, align 8
ret void
}
julia> kernel = @cuda launch=false copy3D!(T2, T, Ci)
CUDA.HostKernel{typeof(copy3D!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}}}(copy3D!, CuFunction(Ptr{CUDA.CUfunc_st} @0x000000000649b690, CuModule(Ptr{CUDA.CUmod_st} @0x0000000006523ff0, CuContext(0x00000000014542b0, instance 2d6c2e17d661c696))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4a00000))
julia> @show CUDA.registers(kernel);
CUDA.registers(kernel) = 10
julia> @show CUDA.memory(kernel);
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0) |
LLVM probably managed to optimize them away. We can't change the fundamental nature of integer literals being |
@luraess, @utkinis, @albert-de-montserrat: could you please run some of your codes with CUDA.jl from this PR and the corresponding GPUCompiler and LLVM branches. In summary the branches are the following: [052768ef] CUDA v4.2.0 `https://github.com/JuliaGPU/CUDA.jl.git#tb/32bit_device_array`
[61eb1bfa] GPUCompiler v0.19.3 `https://github.com/JuliaGPU/GPUCompiler.jl.git#tb/ptx_dl_32bit`
[929cbde3] LLVM v5.0.0 `https://github.com/maleadt/LLVM.jl.git#tb/pointerref_int32` |
@maleadt: I have run some little test codes with increasing complexity (including some examples from https://github.com/omlins/julia-gpu-course/blob/main/solutions/4_datatransfer_optimisations_advanced_part2.ipynb), always casting literal integers to I have have encountered a first issue. The following little diffusion code requires less registers with this PR as expected (and does not include any i64 operations), however, this does not lead to better performance but worse performance. Thus, something is not going as it should. Here is the code: using CUDA
using BenchmarkTools
function diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)
ix = (blockIdx().x-Int32(1)) * blockDim().x + threadIdx().x
iy = (blockIdx().y-Int32(1)) * blockDim().y + threadIdx().y
T_ix_iy_izm1 = 0.0
T_ix_iy_iz = 0.0
T_ix_iy_izp1 = T[ix,iy,Int32(1)]
for iz = Int32(1):size(T2,3)
T_ix_iy_izm1 = T_ix_iy_iz
T_ix_iy_iz = T_ix_iy_izp1
T_ix_iy_izp1 = iz<size(T2,3) ? T[ix,iy,iz+Int32(1)] : 0.0
if (ix>Int32(1) && ix<size(T2,1) && iy>Int32(1) && iy<size(T2,2) && iz>Int32(1) && iz<size(T2,3))
T2[ix,iy,iz] = T_ix_iy_iz + dt*(Ci[ix,iy,iz]*(
- ((-lam*(T[ix+Int32(1),iy,iz] - T_ix_iy_iz)*_dx) - (-lam*(T_ix_iy_iz - T[ix-Int32(1),iy,iz])*_dx))*_dx
- ((-lam*(T[ix,iy+Int32(1),iz] - T_ix_iy_iz)*_dy) - (-lam*(T_ix_iy_iz - T[ix,iy-Int32(1),iz])*_dy))*_dy
- ((-lam*(T_ix_iy_izp1 - T_ix_iy_iz)*_dz) - (-lam*(T_ix_iy_iz - T_ix_iy_izm1)*_dz))*_dz
));
end
end
return
end
function diffusion3D()
# Physics
lam = 1.0; # Thermal conductivity
c0 = 2.0; # Heat capacity
lx, ly, lz = 1.0, 1.0, 1.0; # Length of computational domain in dimension x, y and z
# Numerics
nx, ny, nz = 512, 512, 512; # Number of gridpoints in dimensions x, y and z
nt = 100; # Number of time steps
dx = lx/(nx-1); # Space step in x-dimension
dy = ly/(ny-1); # Space step in y-dimension
dz = lz/(nz-1); # Space step in z-dimension
_dx, _dy, _dz = 1.0/dx, 1.0/dy, 1.0/dz;
# Array initializations
T = CUDA.zeros(Float64, nx, ny, nz);
T2 = CUDA.zeros(Float64, nx, ny, nz);
Ci = CUDA.zeros(Float64, nx, ny, nz);
# Initial conditions
Ci .= 1/c0; # 1/Heat capacity
T .= 1.7;
T2 .= T; # Assign also T2 to get correct boundary conditions.
# GPU launch parameters
threads = (32, 8)
blocks = (nx, ny) .÷ threads
# Time loop
dt = 0.000001 #min(dx^2,dy^2,dz^2)/lam/maximum(Ci)/6.1; # Time step for 3D Heat diffusion
for it = 1:nt
if (it == 11) GC.gc(); global t_tic=time(); end # Start measuring time.
@cuda blocks=blocks threads=threads diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);
synchronize()
T, T2 = T2, T;
end
time_s = time() - t_tic
# Performance
A_eff = (2*1+1)*1/1e9*nx*ny*nz*sizeof(eltype(T)); # Effective main memory access per iteration [GB] (Lower bound of required memory access: T has to be read and written: 2 whole-array memaccess; Ci has to be read: : 1 whole-array memaccess)
t_it = time_s/(nt-10); # Execution time per iteration [s]
T_eff = A_eff/t_it; # Effective memory throughput [GB/s]
println("time_s=$time_s t_it=$t_it T_eff=$T_eff");
# Performance
A_eff = (2*1+1)*1/1e9*nx*ny*nz*sizeof(eltype(T)); # Effective main memory access per iteration [GB] (Lower bound of required memory access: T has to be read and written: 2 whole-array memaccess; Ci has to be read: : 1 whole-array memaccess)
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads diffusion3D_step!($T2, $T, $Ci, $lam, $dt, $_dx, $_dy, $_dz); synchronize() end
println("Benchmarktools (min): t_it=$t_it T_eff=$(A_eff/t_it)");
# Resource usage
@show kernel = @cuda launch=false diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);
@show CUDA.registers(kernel);
@show CUDA.memory(kernel);
@device_code_llvm debuginfo=:none @cuda diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);
end
diffusion3D() Here is the output from running it with this PR: omlins@nid00000:~/tmpwdir/cuda_perf> julia -O3 --check-bounds=no diffusion3D_cuda_3regqueue_novis_int32.jl
time_s=1.0009851455688477 t_it=0.011122057172987197 T_eff=289.6249697244483
Benchmarktools (min): t_it=0.010862522 T_eff=296.54489740043795
kernel = #= /users/omlins/tmpwdir/cuda_perf/diffusion3D_cuda_3regqueue_novis_int32.jl:75 =# @cuda(launch = false, diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)) = CUDA.HostKernel{typeof(diffusion3D_step!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, Float64, Float64, Float64, Float64, Float64}}(diffusion3D_step!, CuFunction(Ptr{CUDA.CUfunc_st} @0x0000000006e97940, CuModule(Ptr{CUDA.CUmod_st} @0x000000000777fd50, CuContext(0x0000000001488260, instance caed86fb0f770aba))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4800000))
CUDA.registers(kernel) = 32
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)
; PTX CompilerJob of MethodInstance for diffusion3D_step!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::Float64, ::Float64, ::Float64, ::Float64, ::Float64) for sm_60
define ptx_kernel void @_Z17diffusion3D_step_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_ES0_S0_S0_S0_S0_([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, double %3, double %4, double %5, double %6, double %7) local_unnamed_addr #1 {
conversion:
%.fca.0.extract38 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
%.fca.2.0.extract40 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
%.fca.2.1.extract41 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
%.fca.2.2.extract42 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 2
%.fca.0.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
%.fca.2.0.extract6 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
%.fca.2.1.extract7 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
%.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 2, 0
%.fca.2.1.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 2, 1
%8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%9 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%10 = mul i32 %9, %8
%11 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%12 = add i32 %10, %11
%13 = add i32 %12, 1
%14 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
%15 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
%16 = mul nuw nsw i32 %15, %14
%17 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
%18 = add nuw nsw i32 %16, %17
%19 = add nuw nsw i32 %18, 1
%20 = icmp sgt i32 %.fca.2.0.extract6, 0
%21 = select i1 %20, i32 %.fca.2.0.extract6, i32 0
%22 = mul i32 %18, %21
%23 = add i32 %12, %22
%24 = bitcast i8 addrspace(1)* %.fca.0.extract4 to double addrspace(1)*
%25 = getelementptr inbounds double, double addrspace(1)* %24, i32 %23
%26 = load double, double addrspace(1)* %25, align 8
%.inv = icmp sgt i32 %.fca.2.2.extract42, 0
%value_phi = select i1 %.inv, i32 %.fca.2.2.extract42, i32 0
%27 = icmp slt i32 %value_phi, 1
%28 = bitcast i8 addrspace(1)* %.fca.0.extract38 to double addrspace(1)*
br i1 %27, label %L560, label %L133.preheader
L133.preheader: ; preds = %conversion
%.fca.0.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 0
%29 = icmp sgt i32 %.fca.2.1.extract7, 0
%30 = select i1 %29, i32 %.fca.2.1.extract7, i32 0
%31 = icmp slt i32 %13, 2
%.not47 = icmp sge i32 %13, %.fca.2.0.extract40
%32 = icmp eq i32 %18, 0
%or.cond = select i1 %.not47, i1 true, i1 %32
%.not48 = icmp sge i32 %19, %.fca.2.1.extract41
%33 = icmp sgt i32 %.fca.2.0.extract, 0
%34 = select i1 %33, i32 %.fca.2.0.extract, i32 0
%35 = icmp sgt i32 %.fca.2.1.extract, 0
%36 = select i1 %35, i32 %.fca.2.1.extract, i32 0
%37 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*
%38 = fneg double %3
%39 = add i32 %12, -1
%40 = add nsw i32 %18, -1
%41 = icmp sgt i32 %.fca.2.0.extract40, 0
%42 = select i1 %41, i32 %.fca.2.0.extract40, i32 0
%43 = icmp sgt i32 %.fca.2.1.extract41, 0
%44 = select i1 %43, i32 %.fca.2.1.extract41, i32 0
br i1 %31, label %L560, label %L133
L133: ; preds = %L547, %L133.preheader
%value_phi4 = phi i32 [ %101, %L547 ], [ 1, %L133.preheader ]
%value_phi6 = phi double [ %value_phi8, %L547 ], [ %26, %L133.preheader ]
%value_phi7 = phi double [ %value_phi6, %L547 ], [ 0.000000e+00, %L133.preheader ]
%.not44 = icmp slt i32 %value_phi4, %.fca.2.2.extract42
br i1 %.not44, label %L141, label %L196
L141: ; preds = %L133
%45 = mul i32 %value_phi4, %30
%reass.add = add i32 %18, %45
%reass.mul = mul i32 %reass.add, %21
%46 = add i32 %12, %reass.mul
%47 = getelementptr inbounds double, double addrspace(1)* %24, i32 %46
%48 = load double, double addrspace(1)* %47, align 8
br label %L196
L196: ; preds = %L141, %L133
%value_phi8 = phi double [ %48, %L141 ], [ 0.000000e+00, %L133 ]
br i1 %or.cond, label %L547, label %L202
L202: ; preds = %L196
%49 = icmp ult i32 %value_phi4, 2
%or.cond63 = select i1 %.not48, i1 true, i1 %49
%.not44.not = xor i1 %.not44, true
%brmerge = select i1 %or.cond63, i1 true, i1 %.not44.not
br i1 %brmerge, label %L547, label %L212
L212: ; preds = %L202
%50 = add nsw i32 %value_phi4, -1
%51 = mul i32 %50, %36
%reass.add64 = add i32 %18, %51
%reass.mul65 = mul i32 %reass.add64, %34
%52 = add i32 %12, %reass.mul65
%53 = getelementptr inbounds double, double addrspace(1)* %37, i32 %52
%54 = load double, double addrspace(1)* %53, align 8
%55 = mul i32 %50, %30
%reass.add66 = add i32 %18, %55
%reass.mul67 = mul i32 %reass.add66, %21
%56 = add i32 %reass.mul67, %13
%57 = getelementptr inbounds double, double addrspace(1)* %24, i32 %56
%58 = load double, double addrspace(1)* %57, align 8
%59 = fsub double %58, %value_phi6
%60 = fmul double %59, %38
%61 = fmul double %60, %5
%62 = add i32 %39, %reass.mul67
%63 = getelementptr inbounds double, double addrspace(1)* %24, i32 %62
%64 = load double, double addrspace(1)* %63, align 8
%65 = fsub double %value_phi6, %64
%66 = fmul double %65, %38
%67 = fmul double %66, %5
%68 = fsub double %61, %67
%69 = fneg double %68
%70 = fmul double %69, %5
%reass.add70 = add i32 %55, %19
%reass.mul71 = mul i32 %reass.add70, %21
%71 = add i32 %12, %reass.mul71
%72 = getelementptr inbounds double, double addrspace(1)* %24, i32 %71
%73 = load double, double addrspace(1)* %72, align 8
%74 = fsub double %73, %value_phi6
%75 = fmul double %74, %38
%76 = fmul double %75, %6
%reass.add72 = add i32 %40, %55
%reass.mul73 = mul i32 %reass.add72, %21
%77 = add i32 %12, %reass.mul73
%78 = getelementptr inbounds double, double addrspace(1)* %24, i32 %77
%79 = load double, double addrspace(1)* %78, align 8
%80 = fsub double %value_phi6, %79
%81 = fmul double %80, %38
%82 = fmul double %81, %6
%83 = fsub double %76, %82
%84 = fmul double %83, %6
%85 = fsub double %70, %84
%86 = fsub double %value_phi8, %value_phi6
%87 = fmul double %86, %38
%88 = fmul double %87, %7
%89 = fsub double %value_phi6, %value_phi7
%90 = fmul double %89, %38
%91 = fmul double %90, %7
%92 = fsub double %88, %91
%93 = fmul double %92, %7
%94 = fsub double %85, %93
%95 = fmul double %54, %94
%96 = fmul double %95, %4
%97 = fadd double %value_phi6, %96
%98 = mul i32 %44, %50
%reass.add74 = add i32 %18, %98
%reass.mul75 = mul i32 %reass.add74, %42
%99 = add i32 %12, %reass.mul75
%100 = getelementptr inbounds double, double addrspace(1)* %28, i32 %99
store double %97, double addrspace(1)* %100, align 8
br label %L547
L547: ; preds = %L212, %L202, %L196
%.not62.not = icmp eq i32 %value_phi4, %value_phi
%101 = add nuw i32 %value_phi4, 1
br i1 %.not62.not, label %L560, label %L133
L560: ; preds = %L547, %L133.preheader, %conversion
ret void
} ... and here is the output from running it with CUDA.jl v4.2.0: omlins@nid02027:~/tmpwdir/cuda_perf> julia -O3 --check-bounds=no diffusion3D_cuda_3regqueue_novis_int32.jl
time_s=0.7261550426483154 t_it=0.008068389362759061 T_eff=399.240211047335
Benchmarktools (min): t_it=0.007945697 T_eff=405.4050226179025
kernel = #= /users/omlins/tmpwdir/cuda_perf/diffusion3D_cuda_3regqueue_novis_int32.jl:75 =# @cuda(launch = false, diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)) = CUDA.HostKernel{typeof(diffusion3D_step!), Tuple{CuDeviceArray{Float64, 3, 1}, CuDeviceArray{Float64, 3, 1}, CuDeviceArray{Float64, 3, 1}, Float64, Float64, Float64, Float64, Float64}}(diffusion3D_step!, CuFunction(Ptr{CUDA.CUfunc_st} @0x00000000075f7cd0, CuModule(Ptr{CUDA.CUmod_st} @0x0000000005c6ed10, CuContext(0x000000000135f910, instance ea771e7429a2560b))), CUDA.KernelState(Ptr{Nothing} @0x00001553a4800000))
CUDA.registers(kernel) = 48
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)
; PTX CompilerJob of MethodInstance for diffusion3D_step!(::CuDeviceArray{Float64, 3, 1}, ::CuDeviceArray{Float64, 3, 1}, ::CuDeviceArray{Float64, 3, 1}, ::Float64, ::Float64, ::Float64, ::Float64, ::Float64) for sm_60
define ptx_kernel void @_Z17diffusion3D_step_13CuDeviceArrayI7Float64Li3ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EES0_S0_S0_S0_S0_([1 x i64] %state, { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, double %3, double %4, double %5, double %6, double %7) local_unnamed_addr #1 {
conversion:
%.fca.0.extract38 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 0
%.fca.2.0.extract40 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 0
%.fca.2.1.extract41 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 1
%.fca.2.2.extract42 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 2
%.fca.0.extract4 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 0
%.fca.2.0.extract6 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 0
%.fca.2.1.extract7 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 1
%.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 2, 0
%.fca.2.1.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 2, 1
%8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%9 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%10 = mul i32 %9, %8
%11 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%12 = add i32 %10, %11
%13 = add i32 %12, 1
%14 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
%15 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
%16 = mul nuw nsw i32 %15, %14
%17 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
%18 = add nuw nsw i32 %16, %17
%19 = add nuw nsw i32 %18, 1
%20 = icmp sgt i64 %.fca.2.0.extract6, 0
%21 = select i1 %20, i64 %.fca.2.0.extract6, i64 0
%22 = sext i32 %13 to i64
%23 = zext i32 %19 to i64
%24 = add nsw i64 %23, -1
%25 = add nsw i64 %22, -1
%26 = bitcast i8 addrspace(1)* %.fca.0.extract4 to double addrspace(1)*
%.inv = icmp sgt i64 %.fca.2.2.extract42, 0
%value_phi = select i1 %.inv, i64 %.fca.2.2.extract42, i64 0
%27 = icmp slt i64 %value_phi, 1
%28 = bitcast i8 addrspace(1)* %.fca.0.extract38 to double addrspace(1)*
br i1 %27, label %L532, label %L131.preheader
L131.preheader: ; preds = %conversion
%29 = mul i64 %24, %21
%30 = add i64 %25, %29
%31 = getelementptr inbounds double, double addrspace(1)* %26, i64 %30
%32 = load double, double addrspace(1)* %31, align 8
%.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 0
%33 = icmp sgt i64 %.fca.2.1.extract7, 0
%34 = select i1 %33, i64 %.fca.2.1.extract7, i64 0
%35 = icmp slt i32 %13, 2
%36 = zext i32 %13 to i64
%.not47 = icmp sle i64 %.fca.2.0.extract40, %36
%37 = icmp eq i32 %18, 0
%.not48 = icmp sle i64 %.fca.2.1.extract41, %23
%38 = icmp sgt i64 %.fca.2.0.extract, 0
%39 = select i1 %38, i64 %.fca.2.0.extract, i64 0
%40 = icmp sgt i64 %.fca.2.1.extract, 0
%41 = select i1 %40, i64 %.fca.2.1.extract, i64 0
%42 = add nsw i64 %36, -1
%43 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*
%44 = fneg double %3
%45 = add i32 %12, 2
%46 = sext i32 %45 to i64
%47 = add nsw i64 %46, -1
%48 = sext i32 %12 to i64
%49 = add nsw i64 %48, -1
%50 = zext i32 %18 to i64
%51 = add nsw i64 %50, -1
%52 = icmp sgt i64 %.fca.2.0.extract40, 0
%53 = select i1 %52, i64 %.fca.2.0.extract40, i64 0
%54 = icmp sgt i64 %.fca.2.1.extract41, 0
%55 = select i1 %54, i64 %.fca.2.1.extract41, i64 0
%56 = select i1 %35, i1 true, i1 %.not47
%brmerge = select i1 %56, i1 true, i1 %37
br label %L131
L131: ; preds = %L519, %L131.preheader
%value_phi4 = phi i64 [ %113, %L519 ], [ 1, %L131.preheader ]
%value_phi6 = phi double [ %value_phi8, %L519 ], [ %32, %L131.preheader ]
%value_phi7 = phi double [ %value_phi6, %L519 ], [ 0.000000e+00, %L131.preheader ]
%.not44 = icmp slt i64 %value_phi4, %.fca.2.2.extract42
br i1 %.not44, label %L139, label %L187
L139: ; preds = %L131
%57 = mul i64 %value_phi4, %34
%reass.add = add i64 %24, %57
%reass.mul = mul i64 %reass.add, %21
%58 = add i64 %25, %reass.mul
%59 = getelementptr inbounds double, double addrspace(1)* %26, i64 %58
%60 = load double, double addrspace(1)* %59, align 8
br label %L187
L187: ; preds = %L139, %L131
%value_phi8 = phi double [ %60, %L139 ], [ 0.000000e+00, %L131 ]
br i1 %brmerge, label %L519, label %L197
L197: ; preds = %L187
%61 = icmp ult i64 %value_phi4, 2
%or.cond63 = select i1 %.not48, i1 true, i1 %61
%.not44.not = xor i1 %.not44, true
%brmerge76 = select i1 %or.cond63, i1 true, i1 %.not44.not
br i1 %brmerge76, label %L519, label %L208
L208: ; preds = %L197
%62 = add nsw i64 %value_phi4, -1
%63 = mul i64 %62, %41
%reass.add64 = add i64 %24, %63
%reass.mul65 = mul i64 %reass.add64, %39
%64 = add i64 %42, %reass.mul65
%65 = getelementptr inbounds double, double addrspace(1)* %43, i64 %64
%66 = load double, double addrspace(1)* %65, align 8
%67 = mul i64 %62, %34
%reass.add66 = add i64 %24, %67
%reass.mul67 = mul i64 %reass.add66, %21
%68 = add i64 %47, %reass.mul67
%69 = getelementptr inbounds double, double addrspace(1)* %26, i64 %68
%70 = load double, double addrspace(1)* %69, align 8
%71 = fsub double %70, %value_phi6
%72 = fmul double %71, %44
%73 = fmul double %72, %5
%74 = add i64 %49, %reass.mul67
%75 = getelementptr inbounds double, double addrspace(1)* %26, i64 %74
%76 = load double, double addrspace(1)* %75, align 8
%77 = fsub double %value_phi6, %76
%78 = fmul double %77, %44
%79 = fmul double %78, %5
%80 = fsub double %73, %79
%81 = fneg double %80
%82 = fmul double %81, %5
%reass.add70 = add i64 %67, %23
%reass.mul71 = mul i64 %reass.add70, %21
%83 = add i64 %42, %reass.mul71
%84 = getelementptr inbounds double, double addrspace(1)* %26, i64 %83
%85 = load double, double addrspace(1)* %84, align 8
%86 = fsub double %85, %value_phi6
%87 = fmul double %86, %44
%88 = fmul double %87, %6
%reass.add72 = add i64 %51, %67
%reass.mul73 = mul i64 %reass.add72, %21
%89 = add i64 %42, %reass.mul73
%90 = getelementptr inbounds double, double addrspace(1)* %26, i64 %89
%91 = load double, double addrspace(1)* %90, align 8
%92 = fsub double %value_phi6, %91
%93 = fmul double %92, %44
%94 = fmul double %93, %6
%95 = fsub double %88, %94
%96 = fmul double %95, %6
%97 = fsub double %82, %96
%98 = fsub double %value_phi8, %value_phi6
%99 = fmul double %98, %44
%100 = fmul double %99, %7
%101 = fsub double %value_phi6, %value_phi7
%102 = fmul double %101, %44
%103 = fmul double %102, %7
%104 = fsub double %100, %103
%105 = fmul double %104, %7
%106 = fsub double %97, %105
%107 = fmul double %66, %106
%108 = fmul double %107, %4
%109 = fadd double %value_phi6, %108
%110 = mul i64 %55, %62
%reass.add74 = add i64 %24, %110
%reass.mul75 = mul i64 %reass.add74, %53
%111 = add i64 %42, %reass.mul75
%112 = getelementptr inbounds double, double addrspace(1)* %28, i64 %111
store double %109, double addrspace(1)* %112, align 8
br label %L519
L519: ; preds = %L208, %L197, %L187
%.not62.not = icmp eq i64 %value_phi4, %value_phi
%113 = add nuw i64 %value_phi4, 1
br i1 %.not62.not, label %L532, label %L131
L532: ; preds = %L519, %conversion
ret void
} |
That's surprising. Nothing in the code points to an obvious performance issue though, to the contrary actually. Try running with NSight Compute to compare kernel execution times. Maybe reuse of i32 registers complicates ILP, and mixing both integer widths inadvertently creates more opportunity for parallelism? In any case, profiling kernels seems necessary here. |
db2723f
to
7e37a56
Compare
7e37a56
to
4b07020
Compare
5d585c4
to
c850163
Compare
This PR makes it possible to customize the index type of
CuDeviceArray
, which is a requirement for performing index calculations in 32-bits. It should improve performance by lowering register pressure, and because certain NVIDIA GPUs can execute fp32 and int32 ops in parallel. This has been requested by HPC people (@luraess or @omlins maybe, I don't remember, and Slack has eaten the conversation), but note that this PR is only the first step, as much of Julia's indexing logic assumes it can use machine-native integers (andInt
is 64 bits on all platforms that support CUDA).As such, this is only a test, and will need work on both CUDA.jl to ensure that
CuDeviceArray
with an additional typevar is supported (i.e. this PR) as well as improvements to CUDA.jl and Base such that 32-bit indices are preserved longer than they currently are (where I'm hoping people will help).So if you're interested in this feature, please contribute by taking your code, running it with CUDA.jl from this PR, inspecting the generated code (e.g. with Cthulhu using
@device_code_warntype interactive=true
, or using@device_code_llvm
), finding where the 32-bit indices get widened to 64-bits, and opening PRs on relevant repositories to try and preserve the index type.Demo of the above:
Note for example how the Julia IR doesn't have any 64-bits integer, yet the LLVM code still promotes. Inspecting with Cthulhu reveals:
This would require a change to LLVM.jl's
pointerref
in order to pass a 32-bit integer togetelementpointer
.cc @jpsamaroo, I think you were also part of the Slack conversation