Inference failure when multiple structs are broadcasted via tuples #2623

charleskawczynski · 2025-01-16T18:48:38Z

I'm not sure if this is the best place for this issue, so please let me know and I can move it if it belongs somewhere else.

I'm running into inference failure when multiple structs are broadcasted via tuples. The CPU, ordinary array version of this looks like the following:

import Base.Broadcast: instantiate, broadcasted, materialize!
struct MyParams1{A}
  a::A
end;
struct MyParams2{B}
  b::B
end;
Base.Broadcast.broadcastable(x::MyParams1) = tuple(x);
Base.Broadcast.broadcastable(x::MyParams2) = tuple(x);

foo(f, p1, p2) = f + p1.a - p2.b;
bar(p1, p2, f) = f + p1.a - p2.b;

FT = Float64;
p1 = MyParams1{FT}(1);
p2 = MyParams2{FT}(2);

b = zeros(FT, 5,5); # Ordinary CPU array works
a = similar(b);
bc = instantiate(broadcasted(foo, b, p1, p2));
materialize!(a, bc)

Here is a reproducer that has all 4 cases I'm looking at.

AFAICT, the actual error/issue seems to be inference failure due to tuple recursion depth limit in the recursive broadcast getindex, but it's kind of surprising because the tuple that is being indexed is ((MyParams1,), (MyParams2,)).

In summary, this is what is working / not working:

	CPU	GPU
Ordinary array	✅	✅
My struct (`VF`)	✅	❌

`CUDA.@device_code_warntype` does seem to detect the issue:

julia> CUDA.@device_code_warntype materialize!(a, bc)
PTX CompilerJob of MethodInstance for knl_copyto!(::VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}, ::Broadcasted{VFStyle{4, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, NTuple{5, Base.OneTo{Int64}}, typeof(foo), Tuple{VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}, Tuple{MyParams1{Float64}}, Tuple{MyParams2{Float64}}}}) for sm_80

MethodInstance for knl_copyto!(::VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}, ::Broadcasted{VFStyle{4, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, NTuple{5, Base.OneTo{Int64}}, typeof(foo), Tuple{VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}, Tuple{MyParams1{Float64}}, Tuple{MyParams2{Float64}}}})
  from knl_copyto!(dest::VF{S, Nv}, src) where {S, Nv} @ Main ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:113
Static Parameters
  S = Float64
  Nv = 4
Arguments
  #self#::Core.Const(knl_copyto!)
  dest::VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}
  src::Broadcasted{VFStyle{4, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, NTuple{5, Base.OneTo{Int64}}, typeof(foo), Tuple{VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}, Tuple{MyParams1{Float64}}, Tuple{MyParams2{Float64}}}}
Locals
  val::Float64
  @_5::Union{}
  @_6::Union{}
  I::CartesianIndex{5}
  v::Int64
  bv::Int32
  tv::Int32
  @_11::Bool
Body::Nothing
1 ─       Core.NewvarNode(:(val))
│         Core.NewvarNode(:(@_5))
│         Core.NewvarNode(:(@_6))
│   %4  = CUDA.threadIdx::Core.Const(CUDA.threadIdx)
│   %5  = (%4)()::@NamedTuple{x::Int32, y::Int32, z::Int32}
│   %6  = Base.indexed_iterate(%5, 1)::Core.PartialStruct(Tuple{Int32, Int64}, Any[Int32, Core.Const(2)])
│         (tv = Core.getfield(%6, 1))
│   %8  = CUDA.blockIdx::Core.Const(CUDA.blockIdx)
│   %9  = (%8)()::@NamedTuple{x::Int32, y::Int32, z::Int32}
│   %10 = Base.indexed_iterate(%9, 1)::Core.PartialStruct(Tuple{Int32, Int64}, Any[Int32, Core.Const(2)])
│         (bv = Core.getfield(%10, 1))
│   %12 = tv::Int32
│   %13 = (bv - 1)::Int64
│   %14 = CUDA.blockDim::Core.Const(CUDA.blockDim)
│   %15 = (%14)()::@NamedTuple{x::Int32, y::Int32, z::Int32}
│   %16 = Base.getproperty(%15, :x)::Int32
│   %17 = (%13 * %16)::Int64
│         (v = %12 + %17)
│   %19 = Core.tuple(1, 1, 1, v, 1)::Core.PartialStruct(NTuple{5, Int64}, Any[Core.Const(1), Core.Const(1), Core.Const(1), Int64, Core.Const(1)])
│         (I = Main.CartesianIndex(%19))
│   %21 = Base.getproperty(I::Core.PartialStruct(CartesianIndex{5}, Any[Core.PartialStruct(NTuple{5, Int64}, Any[Core.Const(1), Core.Const(1), Core.Const(1), Int64, Core.Const(1)])]), :I)::Core.PartialStruct(NTuple{5, Int64}, Any[Core.Const(1), Core.Const(1), Core.Const(1), Int64, Core.Const(1)])
│   %22 = Base.getindex(%21, 4)::Int64
│   %23 = (1 ≤ %22)::Bool
└──       goto #3 if not %23
2 ─       (@_11 = %22 ≤ $(Expr(:static_parameter, 2)))
└──       goto #4
3 ─       (@_11 = false)
4 ┄       goto #6 if not @_11
5 ─       nothing
│   %30 = Base.getindex(src, I::Core.PartialStruct(CartesianIndex{5}, Any[Core.PartialStruct(NTuple{5, Int64}, Any[Core.Const(1), Core.Const(1), Core.Const(1), Int64, Core.Const(1)])]))::Float64
│         Base.setindex!(dest, %30, I::Core.PartialStruct(CartesianIndex{5}, Any[Core.PartialStruct(NTuple{5, Int64}, Any[Core.Const(1), Core.Const(1), Core.Const(1), Int64, Core.Const(1)])]))
│         (val = %30)
│         nothing
└──       val
6 ┄       return Main.nothing

ERROR: InvalidIRError: compiling MethodInstance for knl_copyto!(::VF{Float64, 4, CUDA.CuDeviceMatrix{…}}, ::Broadcasted{VFStyle{…}, NTuple{…}, typeof(foo), Tuple{…}}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to getindex(t::Tuple, i::Int64) @ Base tuple.jl:31)
Stacktrace:
 [1] _getindex (repeats 2 times)
   @ ./broadcast.jl:705
 [2] _broadcast_getindex
   @ ./broadcast.jl:681
 [3] getindex
   @ ./broadcast.jl:636
 [4] knl_copyto!
   @ ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:119
Reason: unsupported call to an unknown function (call to jl_f_getfield)
Stacktrace:
 [1] getproperty
   @ ./Base.jl:37
 [2] foo
   @ ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:149
 [3] _broadcast_getindex_evalf
   @ ./broadcast.jl:709
 [4] _broadcast_getindex
   @ ./broadcast.jl:682
 [5] getindex
   @ ./broadcast.jl:636
 [6] knl_copyto!
   @ ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:119
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/validation.jl:151
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:382 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/6KVfH/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:381 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; toplevel::Bool, libraries::Bool, optimize::Bool, cleanup::Bool, validate::Bool, only_entry::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/utils.jl:108
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/utils.jl:106 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:100
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:82 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:79
 [10] compile
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:74 [inlined]
 [11] #1145
    @ ~/.julia/packages/CUDA/L1qZp/src/compiler/compilation.jl:250 [inlined]
 [12] JuliaContext(f::CUDA.var"#1145#1148"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:34
 [13] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:25
 [14] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/L1qZp/src/compiler/compilation.jl:249
 [15] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/execution.jl:237
 [16] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/execution.jl:151
 [17] macro expansion
    @ ~/.julia/packages/CUDA/L1qZp/src/compiler/execution.jl:380 [inlined]
 [18] macro expansion
    @ ./lock.jl:267 [inlined]
 [19] cufunction(f::typeof(knl_copyto!), tt::Type{Tuple{VF{…}, Broadcasted{…}}}; kwargs::@Kwargs{always_inline::Bool})
    @ CUDA ~/.julia/packages/CUDA/L1qZp/src/compiler/execution.jl:375
 [20] cufunction
    @ ~/.julia/packages/CUDA/L1qZp/src/compiler/execution.jl:372 [inlined]
 [21] macro expansion
    @ ~/.julia/packages/CUDA/L1qZp/src/compiler/execution.jl:112 [inlined]
 [22] copyto!(dest::VF{…}, bc::Broadcasted{…}, to::CUDA.CuArray{…})
    @ Main ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:125
 [23] copyto!
    @ ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:88 [inlined]
 [24] materialize!
    @ ./broadcast.jl:914 [inlined]
 [25] materialize!(dest::VF{Float64, 4, CUDA.CuArray{…}}, bc::Broadcasted{VFStyle{…}, NTuple{…}, typeof(foo), Tuple{…}})
    @ Base.Broadcast ./broadcast.jl:911
 [26] macro expansion
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/reflection.jl:185 [inlined]
 [27] top-level scope
    @ REPL[30]:1
Some type information was truncated. Use `show(err)` to see complete types.

julia>

version info:

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.5
NVIDIA driver 555.42.2

CUDA libraries: 
- CUBLAS: 12.5.2
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+555.42.2

Julia packages: 
- CUDA: 5.6.0
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.10.7
- LLVM: 15.0.7

Manifest.

The text was updated successfully, but these errors were encountered:

maleadt · 2025-01-17T08:23:02Z

The CPU, ordinary array version of this looks like the following

What is the failing GPU version of that simple reproducer? Switching the inputs to GPU arrays works here:

julia> gb = cu(b)
5×5 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0

julia> ga = cu(a)
5×5 CuArray{Float32, 2, CUDA.DeviceMemory}:
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0

julia> gbc = instantiate(broadcasted(foo, gb, p1, p2));

julia> materialize!(ga, gbc)
5×5 CuArray{Float32, 2, CUDA.DeviceMemory}:
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0

In any case, the inference failure can manifest in the CPU case as well, it just executes with dynamic calls.

charleskawczynski · 2025-01-17T13:34:14Z

I'll copy it from the gist here:

#=
using Revise; include("cuda_broadcast_inference_reproducer.jl")
julia --project=.buildkite
julia --project=.buildkite cuda_broadcast_inference_reproducer.jl
julia +1.11 --project=.buildkite cuda_broadcast_inference_reproducer.jl
=#

@show VERSION
@static if !(VERSION ≥ v"1.11.0-beta")
    using JET;
end
import CUDA # comment to run without CUDA
using Test
import Adapt
import Base
import Base.Broadcast: BroadcastStyle,
	Broadcasted, instantiate, broadcasted, materialize, materialize!

struct VF{S <: AbstractFloat, Nv, A}
    array::A
end
struct VFStyle{Nv, A} <: Base.BroadcastStyle end

function VF{S, Nv}(array::AbstractArray{T, 2}) where {S, Nv, T}
    @assert size(array, 1) == Nv
    @assert size(array, 2) == typesize(T, S)
    VF{S, Nv, typeof(array)}(array)
end

function VF{S}(
    ::Type{ArrayType};
    Nv::Integer,
) where {S, ArrayType}
    Nf = typesize(eltype(ArrayType), S)
    array = similar(ArrayType, Nv, Nf)
    fill!(array, 0)
    VF{S, Nv}(array)
end

typesize(::Type{T}, ::Type{S}) where {T, S} = div(sizeof(S), sizeof(T))
parent_array_type(::Type{<:Array{T}}) where {T} = Array{T}
Base.eltype(::Type{<:VF{S}}) where {S} = S
Base.parent(data::VF) = getfield(data, :array)
Base.similar(data::VF{S}) where {S} = similar(data, S)
@inline Base.size(data::VF, i::Integer) = size(data)[i]
@inline Base.size(data::VF{S, Nv}) where {S, Nv} = (1, 1, 1, Nv, 1)
Base.length(data::VF{S, Nv}) where {S, Nv} = Nv
Base.lastindex(data::VF) = length(data)
Base.copy(data::VF{S, NV}) where {S, NV} = VF{S, NV}(copy(parent(data)))
Base.Broadcast.BroadcastStyle(::Type{VF{S, Nv, A}}) where {S, Nv, A} = VFStyle{Nv, parent_array_type(A)}()
Base.Broadcast.BroadcastStyle(::Base.Broadcast.Style{<:Tuple}, ds::VFStyle) = ds
Base.Broadcast.broadcastable(data::VF) = data
Adapt.adapt_structure(to, data::VF{S, NV}) where {S, NV} = VF{S, NV}(Adapt.adapt(to, parent(data)))
@inline parent_array_type(::Type{VF{S, Nv, A}}) where {S, Nv, A} = A
Base.ndims(data::VF) = Base.ndims(typeof(data))
Base.ndims(::Type{T}) where {T <: VF} = Base.ndims(parent_array_type(T))

function Base.similar(
    bc::Union{Base.Broadcast.Broadcasted{VFStyle{Nv, A}}, VF{S, Nv, A}},
    ::Type{S},
) where {Nv, A, S}
    PA = parent_array_type(A)
    array = similar(PA, (Nv, typesize(eltype(A), S)))
    return VF{S, Nv}(array)
end

@inline function Base.getindex(
    data::VF{S, Nv},
    I::CartesianIndex,
) where {S, Nv}
    @boundscheck 1 <= I.I[4] <= Nv || throw(BoundsError(data, I))
    return parent(data)[I.I[4], 1]
end

@inline function Base.setindex!(
    data::VF{S, Nv},
    val,
    I::CartesianIndex,
) where {S, Nv}
    @boundscheck 1 <= I.I[4] <= Nv || throw(BoundsError(data, I))
    parent(data)[I.I[4], 1] = val
end

function Base.copyto!(
    dest::VF{S},
    bc::Union{VF, Base.Broadcast.Broadcasted},
) where {S}
    Base.copyto!(dest, bc, parent(dest))
    dest
end

function Base.copyto!(
    dest::VF{S, Nv},
    bc::Union{Base.Broadcast.Broadcasted{VFStyle{Nv, A}}, VF{S, Nv, A}},
    ::Array,
) where {S, Nv, A}
    @inbounds for v in 1:Nv
        idx = CartesianIndex(1, 1, 1, v, 1)
        dest[idx] = convert(S, bc[idx])
    end
    return dest
end

# Extension
@static if @isdefined(CUDA)
    
    parent_array_type(::Type{<:CUDA.CuArray{T, N, B} where {N}}) where {T, B} = CUDA.CuArray{T, N, B} where {N}
    Base.similar(
        ::Type{CUDA.CuArray{T, N′, B} where {N′}},
        dims::Dims{N},
    ) where {T, N, B} = similar(CUDA.CuArray{T, N, B}, dims)

    function knl_copyto!(dest::VF{S, Nv}, src) where {S, Nv}
        (tv,) = CUDA.threadIdx()
        (bv,) = CUDA.blockIdx()
        v = tv + (bv - 1) * CUDA.blockDim().x
        I = CartesianIndex((1, 1, 1, v, 1))
        if 1 ≤ I.I[4] ≤ Nv
            @inbounds dest[I] = src[I]
        end
        return nothing
    end

    function Base.copyto!(dest::VF{S, Nv}, bc, to::CUDA.CuArray) where {S, Nv}
        kernel = CUDA.@cuda always_inline = true launch = false knl_copyto!(dest, bc)
        config = CUDA.launch_configuration(kernel.fun)

        n_max_threads = min(config.threads, Nv)
        Nvt = fld(n_max_threads, Nv)
        Nv_thread = Nvt == 0 ? n_max_threads : min(Int(Nvt), Nv)
        Nv_blocks = cld(Nv, Nv_thread)
        @assert Nv_thread ≤ n_max_threads "threads,n_max_threads=($(Nv_thread),$n_max_threads)"
        p = (; threads = (Nv_thread,), blocks = (Nv_blocks,))

        kernel(dest, bc; threads = p.threads, blocks = p.blocks)
        return dest
    end
end

struct MyParams1{A}
  a::A
end;
struct MyParams2{B}
  b::B
end;
Base.Broadcast.broadcastable(x::MyParams1) = tuple(x);
Base.Broadcast.broadcastable(x::MyParams2) = tuple(x);

foo(f, p1, p2) = f + p1.a - p2.b;
bar(p1, p2, f) = f + p1.a - p2.b;

FT = Float64;
p1 = MyParams1{FT}(1);
p2 = MyParams2{FT}(2);

@testset "Broken test" begin
    b = zeros(FT, 5,5); # Ordinary CPU array works
    a = similar(b);
    bc = instantiate(broadcasted(foo, b, p1, p2));
    materialize!(a, bc)
    @static if !(VERSION ≥ v"1.11.0-beta")
        @test_opt materialize!(a, bc) # also passes inference
    end

    b = VF{FT}(Array{FT}; Nv=4); # VF with CPU array works
    a = similar(b);
    bc = instantiate(broadcasted(foo, b, p1, p2));
    materialize!(a, bc)
    # @code_warntype materialize!(a, bc) # looks fine
    @static if !(VERSION ≥ v"1.11.0-beta")
        @test_opt materialize!(a, bc) # also passes inference
    end

    @static if @isdefined(CUDA)
        b = CUDA.zeros(FT, 5,5); # CUDA.CuArray works
        a = similar(b);
        bc = instantiate(broadcasted(foo, b, p1, p2));
        materialize!(a, bc)

        b = VF{FT}(CUDA.CuArray{FT}; Nv=4); # VF with CUDA.CuArray fails
        a = similar(b);
        bc = instantiate(broadcasted(foo, b, p1, p2));
        @test_throws CUDA.InvalidIRError materialize!(a, bc) # fails to compile
        # CUDA.@device_code_warntype materialize!(a, bc)
    end
end

#=
# re-run the last, breaking, part:
b = VF{FT}(CUDA.CuArray{FT}; Nv=4); # VF with CUDA.CuArray fails
a = similar(b);
bc = instantiate(broadcasted(foo, b, p1, p2));
materialize!(a, bc) # fails to compile
=#
nothing

Note the @test_throws CUDA.InvalidIRError materialize!(a, bc) # fails to compile line near the end.

charleskawczynski · 2025-01-17T13:35:23Z

In any case, the inference failure can manifest in the CPU case as well, it just executes with dynamic calls.

I suppose that's possible, but I don't think it is because it passes JET.@test_opt

charleskawczynski added the bug Something isn't working label Jan 16, 2025

charleskawczynski mentioned this issue Jan 16, 2025

inference failure when broadcasting over multiple structs and a Field CliMA/ClimaCore.jl#2065

Open

maleadt added upstream Somebody else's problem. and removed bug Something isn't working labels Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference failure when multiple structs are broadcasted via tuples #2623

Inference failure when multiple structs are broadcasted via tuples #2623

charleskawczynski commented Jan 16, 2025

maleadt commented Jan 17, 2025

charleskawczynski commented Jan 17, 2025

charleskawczynski commented Jan 17, 2025

Inference failure when multiple structs are broadcasted via tuples #2623

Inference failure when multiple structs are broadcasted via tuples #2623

Comments

charleskawczynski commented Jan 16, 2025

maleadt commented Jan 17, 2025

charleskawczynski commented Jan 17, 2025

charleskawczynski commented Jan 17, 2025