Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add an index typevar to CuDeviceArray. #1895

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented May 3, 2023

This PR makes it possible to customize the index type of CuDeviceArray, which is a requirement for performing index calculations in 32-bits. It should improve performance by lowering register pressure, and because certain NVIDIA GPUs can execute fp32 and int32 ops in parallel. This has been requested by HPC people (@luraess or @omlins maybe, I don't remember, and Slack has eaten the conversation), but note that this PR is only the first step, as much of Julia's indexing logic assumes it can use machine-native integers (and Int is 64 bits on all platforms that support CUDA).

As such, this is only a test, and will need work on both CUDA.jl to ensure that CuDeviceArray with an additional typevar is supported (i.e. this PR) as well as improvements to CUDA.jl and Base such that 32-bit indices are preserved longer than they currently are (where I'm hoping people will help).

So if you're interested in this feature, please contribute by taking your code, running it with CUDA.jl from this PR, inspecting the generated code (e.g. with Cthulhu using @device_code_warntype interactive=true, or using @device_code_llvm), finding where the 32-bit indices get widened to 64-bits, and opening PRs on relevant repositories to try and preserve the index type.


Demo of the above:

julia> typeof(d_a)
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}

# note how the host array doesn't have an index type. maybe we should?

julia> @device_code_warntype @cuda threads=len vadd(d_a, d_b, d_c)
PTX CompilerJob of MethodInstance for vadd(::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}) for sm_86

MethodInstance for vadd(::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32})
  from vadd(a, b, c) in Main at REPL[21]:1
Arguments
  #self#::Core.Const(vadd)
  a::CuDeviceMatrix{Float32, 1, Int32}
  b::CuDeviceMatrix{Float32, 1, Int32}
  c::CuDeviceMatrix{Float32, 1, Int32}
Locals
  val::Float32
  i::Int32
Body::Nothing
1%1  = Main.blockIdx()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}%2  = Base.getproperty(%1, :x)::Int32%3  = (1 * Main.i32)::Core.Const(1)
│   %4  = (%2 - %3)::Int32%5  = Main.blockDim()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}%6  = Base.getproperty(%5, :x)::Int32%7  = (%4 * %6)::Int32%8  = Main.threadIdx()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}%9  = Base.getproperty(%8, :x)::Int32
│         (i = %7 + %9)
│         nothing%12 = Base.getindex(a, i)::Float32%13 = Base.getindex(b, i)::Float32%14 = (%12 + %13)::Float32
│         Base.setindex!(c, %14, i)
│         (val = %14)
│         nothing
│         val
└──       return nothing


julia> @device_code_llvm debuginfo=:none @cuda threads=len vadd(d_a, d_b, d_c)
; PTX CompilerJob of MethodInstance for vadd(::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}) for sm_86
define ptx_kernel void @_Z4vadd13CuDeviceArrayI7Float32Li2ELi1E5Int32ES_IS0_Li2ELi1ES1_ES_IS0_Li2ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [2 x i32], i32 } %0, { i8 addrspace(1)*, i32, [2 x i32], i32 } %1, { i8 addrspace(1)*, i32, [2 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
  %.fca.0.extract11 = extractvalue { i8 addrspace(1)*, i32, [2 x i32], i32 } %0, 0
  %.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [2 x i32], i32 } %1, 0
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i32, [2 x i32], i32 } %2, 0
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %4 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %5 = mul i32 %4, %3
  %6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %7 = add i32 %5, %6
  %8 = sext i32 %7 to i64
  %9 = bitcast i8 addrspace(1)* %.fca.0.extract11 to float addrspace(1)*
  %10 = getelementptr inbounds float, float addrspace(1)* %9, i64 %8
  %11 = load float, float addrspace(1)* %10, align 4
  %12 = bitcast i8 addrspace(1)* %.fca.0.extract1 to float addrspace(1)*
  %13 = getelementptr inbounds float, float addrspace(1)* %12, i64 %8
  %14 = load float, float addrspace(1)* %13, align 4
  %15 = fadd float %11, %14
  %16 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
  %17 = getelementptr inbounds float, float addrspace(1)* %16, i64 %8
  store float %15, float addrspace(1)* %17, align 4
  ret void
}

Note for example how the Julia IR doesn't have any 64-bits integer, yet the LLVM code still promotes. Inspecting with Cthulhu reveals:

  3 ── %35 = Base.getfield(a, :ptr)::Core.LLVMPtr{Float32, 1}                                                                                                                                                                               ││╻╷╷╷     #arrayref%36 = Base.llvmcall::Core.Const(Core.Intrinsics.llvmcall)                                                                                                                                                                            │││╻╷╷      arrayref_bits
  │    %37 = Core.tuple("; ModuleID = 'llvmcall'\nsource_filename = \"llvmcall\"\n\n; Function Attrs: alwaysinline\ndefine float @entry(i8 addrspace(1)* %0, i64 %1) #0 {\nentry:\n  %2 = bitcast i8 addrspace(1)* %0 to float addrspace(1)*\n  %3 = getelementptr inbounds float, float addrspace(1)* %2, i64 %1\n  %4 = load float, float addrspace(1)* %3, align 4, !tbaa !0\n  ret float %4\n}\n\nattributes #0 = { alwaysinline }\n\n!0 = !{!1, !1, i64 0, i64 0}\n!1 = !{!\"custom_tbaa_addrspace(1)\", !2, i64 0}\n!2 = !{!\"custom_tbaa\"}\n", "entry")::Core.Const(("; ModuleID = 'llvmcall'\nsource_filename = \"llvmcall\"\n\n; Function Attrs: alwaysinline\ndefine float @entry(i8 addrspace(1)* %0, i64 %1) #0 {\nentry:\n  %2 = bitcast i8 addrspace(1)* %0 to float addrspace(1)*\n  %3 = getelementptr inbounds float, float addrspace(1)* %2, i64 %1\n  %4 = load float, float addrspace(1)* %3, align 4, !tbaa !0\n  ret float %4\n}\n\nattributes #0 = { alwaysinline }\n\n!0 = !{!1, !1, i64 0, i64 0}\n!1 = !{!\"custom_tbaa_addrspace(1)\", !2, i64 0}\n!2 = !{!\"custom_tbaa\"}\n", "entry"))
  │    %38 = Base.sub_int(%32, 1)::Int32                                                                                                                                                                                                    │││││┃│││     pointerref
  │    %39 = Core.sext_int(Core.Int64, %38)::Int64                                                                                                                                                                                          ││││││╻        macro expansion
  │    %40 = (%36)(%37, Float32, Tuple{Core.LLVMPtr{Float32, 1}, Int64}, %35, %39)::Float32                                                                                                                                                 │││││││┃        macro expansion
  └───       goto #4                                                                                                                                                                                                                        │││

This would require a change to LLVM.jl's pointerref in order to pass a 32-bit integer to getelementpointer.


cc @jpsamaroo, I think you were also part of the Slack conversation

@maleadt maleadt added enhancement New feature or request performance How fast can we go? help wanted Extra attention is needed labels May 3, 2023
@maleadt maleadt force-pushed the tb/32bit_device_array branch from 254bc7e to db2723f Compare May 3, 2023 15:15
@maleadt
Copy link
Member Author

maleadt commented May 3, 2023

Debugging breadcrumb: mapreduce fails to compile due to broadcast's getindex returning either an Int64 or Int32:

    72 ┄─ %173 = φ (#64 => %119, #71 => %119)::Int64                                                                                                                                                                        │
121 │     %174 = Base.slt_int(%173, %112)::Bool                                                                                                                                                                             │╻╷╷╷           max
    │     %175 = Core.ifelse::Core.Const(Core.ifelse)                                                                                                                                                                       ││╻              map
    │     %176 = (%175)(%174, %112, %173)::Int64                                                                                                                                                                            │││┃│             max
    │            nothing                                                                                                                                                                                                    │││
    │            nothing                                                                                                                                                                                                    ││╻              CartesianIndex
122 │     %179 = Base.getfield(As, 1, false)::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{CuDeviceVector{Int64, 1, Int32}}}                                          │╻╷             _map_getindex
    └────        goto #73                                                                                                                                                                                                   ││╻              getindex
    73 ── %181 = Base.getfield(%179, :args)::Tuple{CuDeviceVector{Int64, 1, Int32}}                                                                                                                                         │││╻╷             _broadcast_getindex
    │     %182 = Base.getfield(%181, 1, false)::CuDeviceVector{Int64, 1, Int32}                                                                                                                                             ││││╻              _getindex
    │     %183 = Base.getfield(%182, :dims)::Tuple{Int32}                                                                                                                                                                   │││││╻╷╷╷           _broadcast_getindex
    └──── %184 = Base.getfield(%183, 1, true)::Int32                                                                                                                                                                        ││││││╻              newindex
    74 ── %185 = Base.slt_int(%184, 0)::Bool                                                                                                                                                                                │││││││╻╷╷╷           axes
    │     %186 = Core.ifelse::Core.Const(Core.ifelse)                                                                                                                                                                       ││││││││╻              map
    │     %187 = (%186)(%185, 0, %184)::Int32                                                                                                                                                                               │││││││││┃││││          oneto
    │            nothing                                                                                                                                                                                                    ││││││││││┃│             OneTo
    └────        goto #75                                                                                                                                                                                                   │││││││││││┃              OneTo
    75 ──        goto #76                                                                                                                                                                                                   │││││││││││
    76 ──        goto #77                                                                                                                                                                                                   ││││││││││
    77 ──        goto #78                                                                                                                                                                                                   │││││││││
    78 ──        goto #79                                                                                                                                                                                                   ││││││││
    79 ── %194 = Core.sext_int(Core.Int64, %187)::Int64                                                                                                                                                                     ││││││││╻╷╷            length
    └──── %195 = (%194 === 1)::Bool                                                                                                                                                                                         ││││││││╻              ==
    80 ──        goto #81                                                                                                                                                                                                   ││││││││╻              getindex
    81 ──        goto #82                                                                                                                                                                                                   │││││││││
    82 ── %198 = Core.ifelse::Core.Const(Core.ifelse)                                                                                                                                                                       ││││││││╻              ifelse
    │     %199 = (%198)(%195, 1, %176)::Union{Int32, Int64}                                                                                                                                                                 │││││││││
    │     %200 = Core.tuple(%199)::Tuple{Union{Int32, Int64}}                                                                                                                                                               ││││││││
    └────        goto #83                                                                                                                                                                                                   ││││││││

@maleadt
Copy link
Member Author

maleadt commented May 9, 2023

I was sceptical that this change would do much without a thorough pass over all of Base, so I did a test using this PR + the LLVM.jl and GPUCompiler.jl PRs above on Broadcast (which does a whole lot of 64-bit integer stuff):

Reference:

PTX (i.e., virtual registers):
    .reg .pred 	%p<10>;
    .reg .b16 	%rs<3>;
    .reg .f32 	%f<7>;
    .reg .b32 	%r<7>;
    .reg .b64 	%rd<58>;

Effective:
    registers(kernel) = 17

This PR:

PTX:
    .reg .pred 	%p<10>;
    .reg .b16 	%rs<3>;
    .reg .f32 	%f<7>;
    .reg .b32 	%r<29>;
    .reg .b64 	%rd<37>;

Effective:
    registers(kernel) = 15

So not a spectacular reduction, but better than I expected nontheless. KA.jl-heavy code probably would benefit much more (unless KA.jl itself assumes Int64)

@vchuravy
Copy link
Member

vchuravy commented May 9, 2023

KA.jl-heavy code probably would benefit much more (unless KA.jl itself assumes Int64)

Currently it does, but we can change that.

@omlins
Copy link
Contributor

omlins commented May 10, 2023

I was sceptical that this change would do much without a thorough pass over all of Base

@maleadt What you did has already a drastic impact on some kernels! 👍 The following example requires half the amount of registers now (improving from 21 to 10 registers):

using CUDA

function copy3D!(T2, T, Ci)
    ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y
    iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z
    @inbounds T2[ix-1,iy-2,iz+1] = 3.4
    @inbounds T[ix-1,iy-2,iz+1] = 3.4
    return
end

T  = CUDA.zeros(Float64, 2,2,2);
T2 = CUDA.zeros(Float64, 2,2,2);
Ci = CUDA.zeros(Float64, 2,2,2);
@device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
kernel = @cuda launch=false copy3D!(T2, T, Ci)
@show CUDA.registers(kernel);
@show CUDA.memory(kernel);

@omlins
Copy link
Contributor

omlins commented May 10, 2023

@maleadt When i remove the UInt32 casting in the above code, then it introduces some i64 operations. Here is the code:

using CUDA

function copy3D!(T2, T, Ci)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
    iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
    @inbounds T2[ix-1,iy-2,iz+1] = 3.4
    @inbounds T[ix-1,iy-2,iz+1] = 3.4
    return
end

T  = CUDA.zeros(Float64, 2,2,2);
T2 = CUDA.zeros(Float64, 2,2,2);
Ci = CUDA.zeros(Float64, 2,2,2);
@device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
kernel = @cuda launch=false copy3D!(T2, T, Ci)
@show CUDA.registers(kernel);
@show CUDA.memory(kernel);

And here is the output:

julia> using CUDA

julia> function copy3D!(T2, T, Ci)
           ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
           iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
           iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
           @inbounds T2[ix-1,iy-2,iz+1] = 3.4
           @inbounds T[ix-1,iy-2,iz+1] = 3.4
           return
       end
copy3D! (generic function with 1 method)

julia> T  = CUDA.zeros(Float64, 2,2,2);

julia> T2 = CUDA.zeros(Float64, 2,2,2);

julia> Ci = CUDA.zeros(Float64, 2,2,2);

julia> @device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
; PTX CompilerJob of MethodInstance for copy3D!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}) for sm_60
define ptx_kernel void @_Z7copy3D_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
  %.fca.0.extract12 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
  %.fca.2.0.extract14 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
  %.fca.2.1.extract15 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
  %.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
  %.fca.2.0.extract3 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
  %.fca.2.1.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %4 = zext i32 %3 to i64
  %5 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %6 = zext i32 %5 to i64
  %7 = mul nuw nsw i64 %6, %4
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %9 = add nuw nsw i32 %8, 1
  %10 = zext i32 %9 to i64
  %11 = add nuw nsw i64 %7, %10
  %12 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
  %13 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
  %narrow = mul nuw nsw i32 %13, %12
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  %15 = add nuw nsw i32 %14, 1
  %narrow26 = add nuw nsw i32 %15, %narrow
  %16 = zext i32 %narrow26 to i64
  %17 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
  %18 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
  %narrow27 = mul nuw nsw i32 %18, %17
  %19 = call i32 @llvm.nvvm.read.ptx.sreg.tid.z()
  %20 = add nuw nsw i32 %19, 1
  %narrow28 = add nuw nsw i32 %20, %narrow27
  %21 = zext i32 %narrow28 to i64
  %22 = icmp sgt i32 %.fca.2.0.extract14, 0
  %23 = select i1 %22, i32 %.fca.2.0.extract14, i32 0
  %24 = icmp sgt i32 %.fca.2.1.extract15, 0
  %25 = select i1 %24, i32 %.fca.2.1.extract15, i32 0
  %26 = zext i32 %23 to i64
  %27 = zext i32 %25 to i64
  %28 = add nsw i64 %16, -3
  %29 = mul nuw nsw i64 %21, %27
  %reass.add = add nsw i64 %28, %29
  %reass.mul = mul i64 %reass.add, %26
  %30 = add nuw nsw i64 %11, 4294967295
  %31 = add i64 %30, %reass.mul
  %32 = bitcast i8 addrspace(1)* %.fca.0.extract12 to double addrspace(1)*
  %33 = trunc i64 %31 to i32
  %34 = add i32 %33, -1
  %35 = getelementptr inbounds double, double addrspace(1)* %32, i32 %34
  store double 3.400000e+00, double addrspace(1)* %35, align 8
  %36 = icmp sgt i32 %.fca.2.0.extract3, 0
  %37 = select i1 %36, i32 %.fca.2.0.extract3, i32 0
  %38 = icmp sgt i32 %.fca.2.1.extract4, 0
  %39 = select i1 %38, i32 %.fca.2.1.extract4, i32 0
  %40 = zext i32 %37 to i64
  %41 = zext i32 %39 to i64
  %42 = mul nuw nsw i64 %21, %41
  %reass.add29 = add nsw i64 %28, %42
  %reass.mul30 = mul i64 %reass.add29, %40
  %43 = add i64 %30, %reass.mul30
  %44 = bitcast i8 addrspace(1)* %.fca.0.extract1 to double addrspace(1)*
  %45 = trunc i64 %43 to i32
  %46 = add i32 %45, -1
  %47 = getelementptr inbounds double, double addrspace(1)* %44, i32 %46
  store double 3.400000e+00, double addrspace(1)* %47, align 8
  ret void
}

julia> kernel = @cuda launch=false copy3D!(T2, T, Ci)
CUDA.HostKernel{typeof(copy3D!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}}}(copy3D!, CuFunction(Ptr{CUDA.CUfunc_st} @0x0000000005f1de40, CuModule(Ptr{CUDA.CUmod_st} @0x0000000005f0fac0, CuContext(0x0000000001401640, instance a86b98eac7129536))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4a00000))

julia> @show CUDA.registers(kernel);
CUDA.registers(kernel) = 12

julia> @show CUDA.memory(kernel);
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)

@maleadt
Copy link
Member Author

maleadt commented May 10, 2023

When i remove the UInt32 casting in the above code, then it introduces some i64 operations.

Well, yeah, because you're introducing Int64s. Or what did you expect?
That's the whole problem with expecting Int32s; Julia's integer literals are 64-bits. And that's why I asked for help to audit existing code, which instead of - 1 will have to do stuff like - one(T) (as far as the dispatch allows, because Dims is hard-coded to Int).

@omlins
Copy link
Contributor

omlins commented May 10, 2023

Well, yeah, because you're introducing Int64s. Or what did you expect?

I originally did expect literals to be treated as Int64. However, as in the first example there are no Int64s introduced (see below), I thought you had taken care of this somehow to some extent. If there is no simple solution to that, I don't think having to cast literals is a priority problem to solve now.

Here is the LLVM code of the first example (note that there are no Int64 introduced, even though we have uncasted literals in the array assignments):

julia> using CUDA

julia> function copy3D!(T2, T, Ci)
           ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x
           iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y
           iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z
           @inbounds T2[ix-1,iy-2,iz+1] = 3.4
           @inbounds T[ix-1,iy-2,iz+1] = 3.4
           return
       end
copy3D! (generic function with 2 methods)

julia> T  = CUDA.zeros(Float64, 2,2,2);

julia> T2 = CUDA.zeros(Float64, 2,2,2);

julia> Ci = CUDA.zeros(Float64, 2,2,2);

julia> @device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)

; PTX CompilerJob of MethodInstance for copy3D!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}) for sm_60
define ptx_kernel void @_Z7copy3D_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
  %.fca.0.extract12 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
  %.fca.2.0.extract14 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
  %.fca.2.1.extract15 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
  %.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
  %.fca.2.0.extract3 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
  %.fca.2.1.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %4 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %5 = mul i32 %4, %3
  %6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %7 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
  %9 = mul nuw nsw i32 %8, %7
  %10 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  %11 = add nuw nsw i32 %10, 1
  %12 = add nuw nsw i32 %11, %9
  %13 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
  %15 = mul nuw nsw i32 %14, %13
  %16 = call i32 @llvm.nvvm.read.ptx.sreg.tid.z()
  %17 = add nuw nsw i32 %16, 1
  %18 = add nuw nsw i32 %17, %15
  %19 = icmp sgt i32 %.fca.2.0.extract14, 0
  %20 = select i1 %19, i32 %.fca.2.0.extract14, i32 0
  %21 = icmp sgt i32 %.fca.2.1.extract15, 0
  %22 = select i1 %21, i32 %.fca.2.1.extract15, i32 0
  %23 = add nsw i32 %12, -3
  %24 = mul i32 %18, %22
  %reass.add = add i32 %23, %24
  %reass.mul = mul i32 %reass.add, %20
  %25 = add i32 %5, %6
  %26 = add i32 %25, -1
  %27 = add i32 %26, %reass.mul
  %28 = bitcast i8 addrspace(1)* %.fca.0.extract12 to double addrspace(1)*
  %29 = getelementptr inbounds double, double addrspace(1)* %28, i32 %27
  store double 3.400000e+00, double addrspace(1)* %29, align 8
  %30 = icmp sgt i32 %.fca.2.0.extract3, 0
  %31 = select i1 %30, i32 %.fca.2.0.extract3, i32 0
  %32 = icmp sgt i32 %.fca.2.1.extract4, 0
  %33 = select i1 %32, i32 %.fca.2.1.extract4, i32 0
  %34 = mul i32 %18, %33
  %reass.add26 = add i32 %23, %34
  %reass.mul27 = mul i32 %reass.add26, %31
  %35 = add i32 %26, %reass.mul27
  %36 = bitcast i8 addrspace(1)* %.fca.0.extract1 to double addrspace(1)*
  %37 = getelementptr inbounds double, double addrspace(1)* %36, i32 %35
  store double 3.400000e+00, double addrspace(1)* %37, align 8
  ret void
}

julia> kernel = @cuda launch=false copy3D!(T2, T, Ci)
CUDA.HostKernel{typeof(copy3D!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}}}(copy3D!, CuFunction(Ptr{CUDA.CUfunc_st} @0x000000000649b690, CuModule(Ptr{CUDA.CUmod_st} @0x0000000006523ff0, CuContext(0x00000000014542b0, instance 2d6c2e17d661c696))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4a00000))

julia> @show CUDA.registers(kernel);
CUDA.registers(kernel) = 10

julia> @show CUDA.memory(kernel);
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)

@maleadt
Copy link
Member Author

maleadt commented May 10, 2023

Here is the LLVM code of the first example (note that there are no Int64 introduced, even though we have uncasted literals in the array assignments):

LLVM probably managed to optimize them away. We can't change the fundamental nature of integer literals being Int64 on 64-bit systems from the GPUCompiler side.

@omlins
Copy link
Contributor

omlins commented May 10, 2023

So if you're interested in this feature, please contribute by taking your code, running it with CUDA.jl from this PR, inspecting the generated code (e.g. with Cthulhu using @device_code_warntype interactive=true, or using @device_code_llvm), finding where the 32-bit indices get widened to 64-bits, and opening PRs on relevant repositories to try and preserve the index type.

@luraess, @utkinis, @albert-de-montserrat: could you please run some of your codes with CUDA.jl from this PR and the corresponding GPUCompiler and LLVM branches. In summary the branches are the following:

  [052768ef] CUDA v4.2.0 `https://github.com/JuliaGPU/CUDA.jl.git#tb/32bit_device_array`
  [61eb1bfa] GPUCompiler v0.19.3 `https://github.com/JuliaGPU/GPUCompiler.jl.git#tb/ptx_dl_32bit`
  [929cbde3] LLVM v5.0.0 `https://github.com/maleadt/LLVM.jl.git#tb/pointerref_int32`

@omlins
Copy link
Contributor

omlins commented May 10, 2023

@maleadt: I have run some little test codes with increasing complexity (including some examples from https://github.com/omlins/julia-gpu-course/blob/main/solutions/4_datatransfer_optimisations_advanced_part2.ipynb), always casting literal integers to Int32.

I have have encountered a first issue. The following little diffusion code requires less registers with this PR as expected (and does not include any i64 operations), however, this does not lead to better performance but worse performance. Thus, something is not going as it should. Here is the code:

using CUDA
using BenchmarkTools

function diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)
    ix = (blockIdx().x-Int32(1)) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-Int32(1)) * blockDim().y + threadIdx().y
    T_ix_iy_izm1 = 0.0
    T_ix_iy_iz   = 0.0
    T_ix_iy_izp1 = T[ix,iy,Int32(1)]
    for iz = Int32(1):size(T2,3)
        T_ix_iy_izm1   = T_ix_iy_iz
        T_ix_iy_iz     = T_ix_iy_izp1
        T_ix_iy_izp1   = iz<size(T2,3) ? T[ix,iy,iz+Int32(1)] : 0.0
        if (ix>Int32(1) && ix<size(T2,1) && iy>Int32(1) && iy<size(T2,2) && iz>Int32(1) && iz<size(T2,3))
            T2[ix,iy,iz] = T_ix_iy_iz + dt*(Ci[ix,iy,iz]*(
                            - ((-lam*(T[ix+Int32(1),iy,iz] - T_ix_iy_iz)*_dx) - (-lam*(T_ix_iy_iz - T[ix-Int32(1),iy,iz])*_dx))*_dx
                            - ((-lam*(T[ix,iy+Int32(1),iz] - T_ix_iy_iz)*_dy) - (-lam*(T_ix_iy_iz - T[ix,iy-Int32(1),iz])*_dy))*_dy
                            - ((-lam*(T_ix_iy_izp1 - T_ix_iy_iz)*_dz) - (-lam*(T_ix_iy_iz - T_ix_iy_izm1)*_dz))*_dz
                            ));
        end
    end
    return
end

function diffusion3D()
# Physics
lam        = 1.0;                                        # Thermal conductivity
c0         = 2.0;                                        # Heat capacity
lx, ly, lz = 1.0, 1.0, 1.0;                              # Length of computational domain in dimension x, y and z

# Numerics
nx, ny, nz = 512, 512, 512;                              # Number of gridpoints in dimensions x, y and z
nt         = 100;                                        # Number of time steps
dx         = lx/(nx-1);                                  # Space step in x-dimension
dy         = ly/(ny-1);                                  # Space step in y-dimension
dz         = lz/(nz-1);                                  # Space step in z-dimension
_dx, _dy, _dz = 1.0/dx, 1.0/dy, 1.0/dz;

# Array initializations
T   = CUDA.zeros(Float64, nx, ny, nz);
T2  = CUDA.zeros(Float64, nx, ny, nz);
Ci  = CUDA.zeros(Float64, nx, ny, nz);

# Initial conditions
Ci .= 1/c0;                                              # 1/Heat capacity
T  .= 1.7;
T2 .= T;                                                 # Assign also T2 to get correct boundary conditions.

# GPU launch parameters
threads = (32, 8)
blocks  = (nx, ny)  threads

# Time loop
dt   = 0.000001 #min(dx^2,dy^2,dz^2)/lam/maximum(Ci)/6.1;          # Time step for 3D Heat diffusion
for it = 1:nt
    if (it == 11) GC.gc(); global t_tic=time(); end      # Start measuring time.
    @cuda blocks=blocks threads=threads diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);
    synchronize()
    T, T2 = T2, T;
end
time_s = time() - t_tic

# Performance
A_eff = (2*1+1)*1/1e9*nx*ny*nz*sizeof(eltype(T));        # Effective main memory access per iteration [GB] (Lower bound of required memory access: T has to be read and written: 2 whole-array memaccess; Ci has to be read: : 1 whole-array memaccess)
t_it  = time_s/(nt-10);                                  # Execution time per iteration [s]
T_eff = A_eff/t_it;                                      # Effective memory throughput [GB/s]
println("time_s=$time_s t_it=$t_it T_eff=$T_eff");

# Performance
A_eff = (2*1+1)*1/1e9*nx*ny*nz*sizeof(eltype(T));        # Effective main memory access per iteration [GB] (Lower bound of required memory access: T has to be read and written: 2 whole-array memaccess; Ci has to be read: : 1 whole-array memaccess)
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads diffusion3D_step!($T2, $T, $Ci, $lam, $dt, $_dx, $_dy, $_dz); synchronize() end
println("Benchmarktools (min): t_it=$t_it T_eff=$(A_eff/t_it)");

# Resource usage
@show kernel = @cuda launch=false diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);
@show CUDA.registers(kernel);
@show CUDA.memory(kernel);
@device_code_llvm debuginfo=:none @cuda diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);

end

diffusion3D()

Here is the output from running it with this PR:

omlins@nid00000:~/tmpwdir/cuda_perf> julia -O3 --check-bounds=no diffusion3D_cuda_3regqueue_novis_int32.jl
time_s=1.0009851455688477 t_it=0.011122057172987197 T_eff=289.6249697244483
Benchmarktools (min): t_it=0.010862522 T_eff=296.54489740043795
kernel = #= /users/omlins/tmpwdir/cuda_perf/diffusion3D_cuda_3regqueue_novis_int32.jl:75 =# @cuda(launch = false, diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)) = CUDA.HostKernel{typeof(diffusion3D_step!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, Float64, Float64, Float64, Float64, Float64}}(diffusion3D_step!, CuFunction(Ptr{CUDA.CUfunc_st} @0x0000000006e97940, CuModule(Ptr{CUDA.CUmod_st} @0x000000000777fd50, CuContext(0x0000000001488260, instance caed86fb0f770aba))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4800000))
CUDA.registers(kernel) = 32
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)
; PTX CompilerJob of MethodInstance for diffusion3D_step!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::Float64, ::Float64, ::Float64, ::Float64, ::Float64) for sm_60
define ptx_kernel void @_Z17diffusion3D_step_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_ES0_S0_S0_S0_S0_([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, double %3, double %4, double %5, double %6, double %7) local_unnamed_addr #1 {
conversion:
  %.fca.0.extract38 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
  %.fca.2.0.extract40 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
  %.fca.2.1.extract41 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
  %.fca.2.2.extract42 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 2
  %.fca.0.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
  %.fca.2.0.extract6 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
  %.fca.2.1.extract7 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 2, 0
  %.fca.2.1.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 2, 1
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %9 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %10 = mul i32 %9, %8
  %11 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %12 = add i32 %10, %11
  %13 = add i32 %12, 1
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
  %15 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
  %16 = mul nuw nsw i32 %15, %14
  %17 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  %18 = add nuw nsw i32 %16, %17
  %19 = add nuw nsw i32 %18, 1
  %20 = icmp sgt i32 %.fca.2.0.extract6, 0
  %21 = select i1 %20, i32 %.fca.2.0.extract6, i32 0
  %22 = mul i32 %18, %21
  %23 = add i32 %12, %22
  %24 = bitcast i8 addrspace(1)* %.fca.0.extract4 to double addrspace(1)*
  %25 = getelementptr inbounds double, double addrspace(1)* %24, i32 %23
  %26 = load double, double addrspace(1)* %25, align 8
  %.inv = icmp sgt i32 %.fca.2.2.extract42, 0
  %value_phi = select i1 %.inv, i32 %.fca.2.2.extract42, i32 0
  %27 = icmp slt i32 %value_phi, 1
  %28 = bitcast i8 addrspace(1)* %.fca.0.extract38 to double addrspace(1)*
  br i1 %27, label %L560, label %L133.preheader

L133.preheader:                                   ; preds = %conversion
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 0
  %29 = icmp sgt i32 %.fca.2.1.extract7, 0
  %30 = select i1 %29, i32 %.fca.2.1.extract7, i32 0
  %31 = icmp slt i32 %13, 2
  %.not47 = icmp sge i32 %13, %.fca.2.0.extract40
  %32 = icmp eq i32 %18, 0
  %or.cond = select i1 %.not47, i1 true, i1 %32
  %.not48 = icmp sge i32 %19, %.fca.2.1.extract41
  %33 = icmp sgt i32 %.fca.2.0.extract, 0
  %34 = select i1 %33, i32 %.fca.2.0.extract, i32 0
  %35 = icmp sgt i32 %.fca.2.1.extract, 0
  %36 = select i1 %35, i32 %.fca.2.1.extract, i32 0
  %37 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*
  %38 = fneg double %3
  %39 = add i32 %12, -1
  %40 = add nsw i32 %18, -1
  %41 = icmp sgt i32 %.fca.2.0.extract40, 0
  %42 = select i1 %41, i32 %.fca.2.0.extract40, i32 0
  %43 = icmp sgt i32 %.fca.2.1.extract41, 0
  %44 = select i1 %43, i32 %.fca.2.1.extract41, i32 0
  br i1 %31, label %L560, label %L133

L133:                                             ; preds = %L547, %L133.preheader
  %value_phi4 = phi i32 [ %101, %L547 ], [ 1, %L133.preheader ]
  %value_phi6 = phi double [ %value_phi8, %L547 ], [ %26, %L133.preheader ]
  %value_phi7 = phi double [ %value_phi6, %L547 ], [ 0.000000e+00, %L133.preheader ]
  %.not44 = icmp slt i32 %value_phi4, %.fca.2.2.extract42
  br i1 %.not44, label %L141, label %L196

L141:                                             ; preds = %L133
  %45 = mul i32 %value_phi4, %30
  %reass.add = add i32 %18, %45
  %reass.mul = mul i32 %reass.add, %21
  %46 = add i32 %12, %reass.mul
  %47 = getelementptr inbounds double, double addrspace(1)* %24, i32 %46
  %48 = load double, double addrspace(1)* %47, align 8
  br label %L196

L196:                                             ; preds = %L141, %L133
  %value_phi8 = phi double [ %48, %L141 ], [ 0.000000e+00, %L133 ]
  br i1 %or.cond, label %L547, label %L202

L202:                                             ; preds = %L196
  %49 = icmp ult i32 %value_phi4, 2
  %or.cond63 = select i1 %.not48, i1 true, i1 %49
  %.not44.not = xor i1 %.not44, true
  %brmerge = select i1 %or.cond63, i1 true, i1 %.not44.not
  br i1 %brmerge, label %L547, label %L212

L212:                                             ; preds = %L202
  %50 = add nsw i32 %value_phi4, -1
  %51 = mul i32 %50, %36
  %reass.add64 = add i32 %18, %51
  %reass.mul65 = mul i32 %reass.add64, %34
  %52 = add i32 %12, %reass.mul65
  %53 = getelementptr inbounds double, double addrspace(1)* %37, i32 %52
  %54 = load double, double addrspace(1)* %53, align 8
  %55 = mul i32 %50, %30
  %reass.add66 = add i32 %18, %55
  %reass.mul67 = mul i32 %reass.add66, %21
  %56 = add i32 %reass.mul67, %13
  %57 = getelementptr inbounds double, double addrspace(1)* %24, i32 %56
  %58 = load double, double addrspace(1)* %57, align 8
  %59 = fsub double %58, %value_phi6
  %60 = fmul double %59, %38
  %61 = fmul double %60, %5
  %62 = add i32 %39, %reass.mul67
  %63 = getelementptr inbounds double, double addrspace(1)* %24, i32 %62
  %64 = load double, double addrspace(1)* %63, align 8
  %65 = fsub double %value_phi6, %64
  %66 = fmul double %65, %38
  %67 = fmul double %66, %5
  %68 = fsub double %61, %67
  %69 = fneg double %68
  %70 = fmul double %69, %5
  %reass.add70 = add i32 %55, %19
  %reass.mul71 = mul i32 %reass.add70, %21
  %71 = add i32 %12, %reass.mul71
  %72 = getelementptr inbounds double, double addrspace(1)* %24, i32 %71
  %73 = load double, double addrspace(1)* %72, align 8
  %74 = fsub double %73, %value_phi6
  %75 = fmul double %74, %38
  %76 = fmul double %75, %6
  %reass.add72 = add i32 %40, %55
  %reass.mul73 = mul i32 %reass.add72, %21
  %77 = add i32 %12, %reass.mul73
  %78 = getelementptr inbounds double, double addrspace(1)* %24, i32 %77
  %79 = load double, double addrspace(1)* %78, align 8
  %80 = fsub double %value_phi6, %79
  %81 = fmul double %80, %38
  %82 = fmul double %81, %6
  %83 = fsub double %76, %82
  %84 = fmul double %83, %6
  %85 = fsub double %70, %84
  %86 = fsub double %value_phi8, %value_phi6
  %87 = fmul double %86, %38
  %88 = fmul double %87, %7
  %89 = fsub double %value_phi6, %value_phi7
  %90 = fmul double %89, %38
  %91 = fmul double %90, %7
  %92 = fsub double %88, %91
  %93 = fmul double %92, %7
  %94 = fsub double %85, %93
  %95 = fmul double %54, %94
  %96 = fmul double %95, %4
  %97 = fadd double %value_phi6, %96
  %98 = mul i32 %44, %50
  %reass.add74 = add i32 %18, %98
  %reass.mul75 = mul i32 %reass.add74, %42
  %99 = add i32 %12, %reass.mul75
  %100 = getelementptr inbounds double, double addrspace(1)* %28, i32 %99
  store double %97, double addrspace(1)* %100, align 8
  br label %L547

L547:                                             ; preds = %L212, %L202, %L196
  %.not62.not = icmp eq i32 %value_phi4, %value_phi
  %101 = add nuw i32 %value_phi4, 1
  br i1 %.not62.not, label %L560, label %L133

L560:                                             ; preds = %L547, %L133.preheader, %conversion
  ret void
}

... and here is the output from running it with CUDA.jl v4.2.0:

omlins@nid02027:~/tmpwdir/cuda_perf> julia -O3 --check-bounds=no diffusion3D_cuda_3regqueue_novis_int32.jl
time_s=0.7261550426483154 t_it=0.008068389362759061 T_eff=399.240211047335
Benchmarktools (min): t_it=0.007945697 T_eff=405.4050226179025
kernel = #= /users/omlins/tmpwdir/cuda_perf/diffusion3D_cuda_3regqueue_novis_int32.jl:75 =# @cuda(launch = false, diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)) = CUDA.HostKernel{typeof(diffusion3D_step!), Tuple{CuDeviceArray{Float64, 3, 1}, CuDeviceArray{Float64, 3, 1}, CuDeviceArray{Float64, 3, 1}, Float64, Float64, Float64, Float64, Float64}}(diffusion3D_step!, CuFunction(Ptr{CUDA.CUfunc_st} @0x00000000075f7cd0, CuModule(Ptr{CUDA.CUmod_st} @0x0000000005c6ed10, CuContext(0x000000000135f910, instance ea771e7429a2560b))), CUDA.KernelState(Ptr{Nothing} @0x00001553a4800000))
CUDA.registers(kernel) = 48
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)
; PTX CompilerJob of MethodInstance for diffusion3D_step!(::CuDeviceArray{Float64, 3, 1}, ::CuDeviceArray{Float64, 3, 1}, ::CuDeviceArray{Float64, 3, 1}, ::Float64, ::Float64, ::Float64, ::Float64, ::Float64) for sm_60
define ptx_kernel void @_Z17diffusion3D_step_13CuDeviceArrayI7Float64Li3ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EES0_S0_S0_S0_S0_([1 x i64] %state, { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, double %3, double %4, double %5, double %6, double %7) local_unnamed_addr #1 {
conversion:
  %.fca.0.extract38 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 0
  %.fca.2.0.extract40 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 0
  %.fca.2.1.extract41 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 1
  %.fca.2.2.extract42 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 2
  %.fca.0.extract4 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 0
  %.fca.2.0.extract6 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 0
  %.fca.2.1.extract7 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 1
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 2, 0
  %.fca.2.1.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 2, 1
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %9 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %10 = mul i32 %9, %8
  %11 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %12 = add i32 %10, %11
  %13 = add i32 %12, 1
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
  %15 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
  %16 = mul nuw nsw i32 %15, %14
  %17 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  %18 = add nuw nsw i32 %16, %17
  %19 = add nuw nsw i32 %18, 1
  %20 = icmp sgt i64 %.fca.2.0.extract6, 0
  %21 = select i1 %20, i64 %.fca.2.0.extract6, i64 0
  %22 = sext i32 %13 to i64
  %23 = zext i32 %19 to i64
  %24 = add nsw i64 %23, -1
  %25 = add nsw i64 %22, -1
  %26 = bitcast i8 addrspace(1)* %.fca.0.extract4 to double addrspace(1)*
  %.inv = icmp sgt i64 %.fca.2.2.extract42, 0
  %value_phi = select i1 %.inv, i64 %.fca.2.2.extract42, i64 0
  %27 = icmp slt i64 %value_phi, 1
  %28 = bitcast i8 addrspace(1)* %.fca.0.extract38 to double addrspace(1)*
  br i1 %27, label %L532, label %L131.preheader

L131.preheader:                                   ; preds = %conversion
  %29 = mul i64 %24, %21
  %30 = add i64 %25, %29
  %31 = getelementptr inbounds double, double addrspace(1)* %26, i64 %30
  %32 = load double, double addrspace(1)* %31, align 8
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 0
  %33 = icmp sgt i64 %.fca.2.1.extract7, 0
  %34 = select i1 %33, i64 %.fca.2.1.extract7, i64 0
  %35 = icmp slt i32 %13, 2
  %36 = zext i32 %13 to i64
  %.not47 = icmp sle i64 %.fca.2.0.extract40, %36
  %37 = icmp eq i32 %18, 0
  %.not48 = icmp sle i64 %.fca.2.1.extract41, %23
  %38 = icmp sgt i64 %.fca.2.0.extract, 0
  %39 = select i1 %38, i64 %.fca.2.0.extract, i64 0
  %40 = icmp sgt i64 %.fca.2.1.extract, 0
  %41 = select i1 %40, i64 %.fca.2.1.extract, i64 0
  %42 = add nsw i64 %36, -1
  %43 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*
  %44 = fneg double %3
  %45 = add i32 %12, 2
  %46 = sext i32 %45 to i64
  %47 = add nsw i64 %46, -1
  %48 = sext i32 %12 to i64
  %49 = add nsw i64 %48, -1
  %50 = zext i32 %18 to i64
  %51 = add nsw i64 %50, -1
  %52 = icmp sgt i64 %.fca.2.0.extract40, 0
  %53 = select i1 %52, i64 %.fca.2.0.extract40, i64 0
  %54 = icmp sgt i64 %.fca.2.1.extract41, 0
  %55 = select i1 %54, i64 %.fca.2.1.extract41, i64 0
  %56 = select i1 %35, i1 true, i1 %.not47
  %brmerge = select i1 %56, i1 true, i1 %37
  br label %L131

L131:                                             ; preds = %L519, %L131.preheader
  %value_phi4 = phi i64 [ %113, %L519 ], [ 1, %L131.preheader ]
  %value_phi6 = phi double [ %value_phi8, %L519 ], [ %32, %L131.preheader ]
  %value_phi7 = phi double [ %value_phi6, %L519 ], [ 0.000000e+00, %L131.preheader ]
  %.not44 = icmp slt i64 %value_phi4, %.fca.2.2.extract42
  br i1 %.not44, label %L139, label %L187

L139:                                             ; preds = %L131
  %57 = mul i64 %value_phi4, %34
  %reass.add = add i64 %24, %57
  %reass.mul = mul i64 %reass.add, %21
  %58 = add i64 %25, %reass.mul
  %59 = getelementptr inbounds double, double addrspace(1)* %26, i64 %58
  %60 = load double, double addrspace(1)* %59, align 8
  br label %L187

L187:                                             ; preds = %L139, %L131
  %value_phi8 = phi double [ %60, %L139 ], [ 0.000000e+00, %L131 ]
  br i1 %brmerge, label %L519, label %L197

L197:                                             ; preds = %L187
  %61 = icmp ult i64 %value_phi4, 2
  %or.cond63 = select i1 %.not48, i1 true, i1 %61
  %.not44.not = xor i1 %.not44, true
  %brmerge76 = select i1 %or.cond63, i1 true, i1 %.not44.not
  br i1 %brmerge76, label %L519, label %L208

L208:                                             ; preds = %L197
  %62 = add nsw i64 %value_phi4, -1
  %63 = mul i64 %62, %41
  %reass.add64 = add i64 %24, %63
  %reass.mul65 = mul i64 %reass.add64, %39
  %64 = add i64 %42, %reass.mul65
  %65 = getelementptr inbounds double, double addrspace(1)* %43, i64 %64
  %66 = load double, double addrspace(1)* %65, align 8
  %67 = mul i64 %62, %34
  %reass.add66 = add i64 %24, %67
  %reass.mul67 = mul i64 %reass.add66, %21
  %68 = add i64 %47, %reass.mul67
  %69 = getelementptr inbounds double, double addrspace(1)* %26, i64 %68
  %70 = load double, double addrspace(1)* %69, align 8
  %71 = fsub double %70, %value_phi6
  %72 = fmul double %71, %44
  %73 = fmul double %72, %5
  %74 = add i64 %49, %reass.mul67
  %75 = getelementptr inbounds double, double addrspace(1)* %26, i64 %74
  %76 = load double, double addrspace(1)* %75, align 8
  %77 = fsub double %value_phi6, %76
  %78 = fmul double %77, %44
  %79 = fmul double %78, %5
  %80 = fsub double %73, %79
  %81 = fneg double %80
  %82 = fmul double %81, %5
  %reass.add70 = add i64 %67, %23
  %reass.mul71 = mul i64 %reass.add70, %21
  %83 = add i64 %42, %reass.mul71
  %84 = getelementptr inbounds double, double addrspace(1)* %26, i64 %83
  %85 = load double, double addrspace(1)* %84, align 8
  %86 = fsub double %85, %value_phi6
  %87 = fmul double %86, %44
  %88 = fmul double %87, %6
  %reass.add72 = add i64 %51, %67
  %reass.mul73 = mul i64 %reass.add72, %21
  %89 = add i64 %42, %reass.mul73
  %90 = getelementptr inbounds double, double addrspace(1)* %26, i64 %89
  %91 = load double, double addrspace(1)* %90, align 8
  %92 = fsub double %value_phi6, %91
  %93 = fmul double %92, %44
  %94 = fmul double %93, %6
  %95 = fsub double %88, %94
  %96 = fmul double %95, %6
  %97 = fsub double %82, %96
  %98 = fsub double %value_phi8, %value_phi6
  %99 = fmul double %98, %44
  %100 = fmul double %99, %7
  %101 = fsub double %value_phi6, %value_phi7
  %102 = fmul double %101, %44
  %103 = fmul double %102, %7
  %104 = fsub double %100, %103
  %105 = fmul double %104, %7
  %106 = fsub double %97, %105
  %107 = fmul double %66, %106
  %108 = fmul double %107, %4
  %109 = fadd double %value_phi6, %108
  %110 = mul i64 %55, %62
  %reass.add74 = add i64 %24, %110
  %reass.mul75 = mul i64 %reass.add74, %53
  %111 = add i64 %42, %reass.mul75
  %112 = getelementptr inbounds double, double addrspace(1)* %28, i64 %111
  store double %109, double addrspace(1)* %112, align 8
  br label %L519

L519:                                             ; preds = %L208, %L197, %L187
  %.not62.not = icmp eq i64 %value_phi4, %value_phi
  %113 = add nuw i64 %value_phi4, 1
  br i1 %.not62.not, label %L532, label %L131

L532:                                             ; preds = %L519, %conversion
  ret void
}

@maleadt
Copy link
Member Author

maleadt commented May 10, 2023

That's surprising. Nothing in the code points to an obvious performance issue though, to the contrary actually. Try running with NSight Compute to compare kernel execution times. Maybe reuse of i32 registers complicates ILP, and mixing both integer widths inadvertently creates more opportunity for parallelism? In any case, profiling kernels seems necessary here.

@maleadt maleadt force-pushed the tb/32bit_device_array branch from 7e37a56 to 4b07020 Compare January 9, 2024 16:00
@maleadt maleadt force-pushed the master branch 15 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants