Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference failure when multiple structs are broadcasted via tuples #2623

Open
charleskawczynski opened this issue Jan 16, 2025 · 3 comments
Open
Labels
upstream Somebody else's problem.

Comments

@charleskawczynski
Copy link
Contributor

I'm not sure if this is the best place for this issue, so please let me know and I can move it if it belongs somewhere else.

I'm running into inference failure when multiple structs are broadcasted via tuples. The CPU, ordinary array version of this looks like the following:

import Base.Broadcast: instantiate, broadcasted, materialize!
struct MyParams1{A}
  a::A
end;
struct MyParams2{B}
  b::B
end;
Base.Broadcast.broadcastable(x::MyParams1) = tuple(x);
Base.Broadcast.broadcastable(x::MyParams2) = tuple(x);

foo(f, p1, p2) = f + p1.a - p2.b;
bar(p1, p2, f) = f + p1.a - p2.b;

FT = Float64;
p1 = MyParams1{FT}(1);
p2 = MyParams2{FT}(2);

b = zeros(FT, 5,5); # Ordinary CPU array works
a = similar(b);
bc = instantiate(broadcasted(foo, b, p1, p2));
materialize!(a, bc)

Here is a reproducer that has all 4 cases I'm looking at.

AFAICT, the actual error/issue seems to be inference failure due to tuple recursion depth limit in the recursive broadcast getindex, but it's kind of surprising because the tuple that is being indexed is ((MyParams1,), (MyParams2,)).

In summary, this is what is working / not working:

CPU GPU
Ordinary array
My struct (VF)
`CUDA.@device_code_warntype` does seem to detect the issue:
julia> CUDA.@device_code_warntype materialize!(a, bc)
PTX CompilerJob of MethodInstance for knl_copyto!(::VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}, ::Broadcasted{VFStyle{4, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, NTuple{5, Base.OneTo{Int64}}, typeof(foo), Tuple{VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}, Tuple{MyParams1{Float64}}, Tuple{MyParams2{Float64}}}}) for sm_80

MethodInstance for knl_copyto!(::VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}, ::Broadcasted{VFStyle{4, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, NTuple{5, Base.OneTo{Int64}}, typeof(foo), Tuple{VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}, Tuple{MyParams1{Float64}}, Tuple{MyParams2{Float64}}}})
  from knl_copyto!(dest::VF{S, Nv}, src) where {S, Nv} @ Main ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:113
Static Parameters
  S = Float64
  Nv = 4
Arguments
  #self#::Core.Const(knl_copyto!)
  dest::VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}
  src::Broadcasted{VFStyle{4, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, NTuple{5, Base.OneTo{Int64}}, typeof(foo), Tuple{VF{Float64, 4, CUDA.CuDeviceMatrix{Float64, 1}}, Tuple{MyParams1{Float64}}, Tuple{MyParams2{Float64}}}}
Locals
  val::Float64
  @_5::Union{}
  @_6::Union{}
  I::CartesianIndex{5}
  v::Int64
  bv::Int32
  tv::Int32
  @_11::Bool
Body::Nothing
1 ─       Core.NewvarNode(:(val))
│         Core.NewvarNode(:(@_5))
│         Core.NewvarNode(:(@_6))
│   %4  = CUDA.threadIdx::Core.Const(CUDA.threadIdx)
│   %5  = (%4)()::@NamedTuple{x::Int32, y::Int32, z::Int32}
│   %6  = Base.indexed_iterate(%5, 1)::Core.PartialStruct(Tuple{Int32, Int64}, Any[Int32, Core.Const(2)])
│         (tv = Core.getfield(%6, 1))
│   %8  = CUDA.blockIdx::Core.Const(CUDA.blockIdx)
│   %9  = (%8)()::@NamedTuple{x::Int32, y::Int32, z::Int32}
│   %10 = Base.indexed_iterate(%9, 1)::Core.PartialStruct(Tuple{Int32, Int64}, Any[Int32, Core.Const(2)])
│         (bv = Core.getfield(%10, 1))
│   %12 = tv::Int32%13 = (bv - 1)::Int64%14 = CUDA.blockDim::Core.Const(CUDA.blockDim)
│   %15 = (%14)()::@NamedTuple{x::Int32, y::Int32, z::Int32}
│   %16 = Base.getproperty(%15, :x)::Int32%17 = (%13 * %16)::Int64
│         (v = %12 + %17)
│   %19 = Core.tuple(1, 1, 1, v, 1)::Core.PartialStruct(NTuple{5, Int64}, Any[Core.Const(1), Core.Const(1), Core.Const(1), Int64, Core.Const(1)])
│         (I = Main.CartesianIndex(%19))
│   %21 = Base.getproperty(I::Core.PartialStruct(CartesianIndex{5}, Any[Core.PartialStruct(NTuple{5, Int64}, Any[Core.Const(1), Core.Const(1), Core.Const(1), Int64, Core.Const(1)])]), :I)::Core.PartialStruct(NTuple{5, Int64}, Any[Core.Const(1), Core.Const(1), Core.Const(1), Int64, Core.Const(1)])
│   %22 = Base.getindex(%21, 4)::Int64%23 = (1  %22)::Bool
└──       goto #3 if not %23
2 ─       (@_11 = %22  $(Expr(:static_parameter, 2)))
└──       goto #4
3 ─       (@_11 = false)
4 ┄       goto #6 if not @_11
5nothing%30 = Base.getindex(src, I::Core.PartialStruct(CartesianIndex{5}, Any[Core.PartialStruct(NTuple{5, Int64}, Any[Core.Const(1), Core.Const(1), Core.Const(1), Int64, Core.Const(1)])]))::Float64
│         Base.setindex!(dest, %30, I::Core.PartialStruct(CartesianIndex{5}, Any[Core.PartialStruct(NTuple{5, Int64}, Any[Core.Const(1), Core.Const(1), Core.Const(1), Int64, Core.Const(1)])]))
│         (val = %30)
│         nothing
└──       val
6return Main.nothing

ERROR: InvalidIRError: compiling MethodInstance for knl_copyto!(::VF{Float64, 4, CUDA.CuDeviceMatrix{…}}, ::Broadcasted{VFStyle{…}, NTuple{…}, typeof(foo), Tuple{…}}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to getindex(t::Tuple, i::Int64) @ Base tuple.jl:31)
Stacktrace:
 [1] _getindex (repeats 2 times)
   @ ./broadcast.jl:705
 [2] _broadcast_getindex
   @ ./broadcast.jl:681
 [3] getindex
   @ ./broadcast.jl:636
 [4] knl_copyto!
   @ ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:119
Reason: unsupported call to an unknown function (call to jl_f_getfield)
Stacktrace:
 [1] getproperty
   @ ./Base.jl:37
 [2] foo
   @ ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:149
 [3] _broadcast_getindex_evalf
   @ ./broadcast.jl:709
 [4] _broadcast_getindex
   @ ./broadcast.jl:682
 [5] getindex
   @ ./broadcast.jl:636
 [6] knl_copyto!
   @ ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:119
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/validation.jl:151
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:382 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/6KVfH/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:381 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; toplevel::Bool, libraries::Bool, optimize::Bool, cleanup::Bool, validate::Bool, only_entry::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/utils.jl:108
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/utils.jl:106 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:100
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:82 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:79
 [10] compile
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:74 [inlined]
 [11] #1145
    @ ~/.julia/packages/CUDA/L1qZp/src/compiler/compilation.jl:250 [inlined]
 [12] JuliaContext(f::CUDA.var"#1145#1148"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:34
 [13] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/driver.jl:25
 [14] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/L1qZp/src/compiler/compilation.jl:249
 [15] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/execution.jl:237
 [16] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/GnbhK/src/execution.jl:151
 [17] macro expansion
    @ ~/.julia/packages/CUDA/L1qZp/src/compiler/execution.jl:380 [inlined]
 [18] macro expansion
    @ ./lock.jl:267 [inlined]
 [19] cufunction(f::typeof(knl_copyto!), tt::Type{Tuple{VF{…}, Broadcasted{…}}}; kwargs::@Kwargs{always_inline::Bool})
    @ CUDA ~/.julia/packages/CUDA/L1qZp/src/compiler/execution.jl:375
 [20] cufunction
    @ ~/.julia/packages/CUDA/L1qZp/src/compiler/execution.jl:372 [inlined]
 [21] macro expansion
    @ ~/.julia/packages/CUDA/L1qZp/src/compiler/execution.jl:112 [inlined]
 [22] copyto!(dest::VF{…}, bc::Broadcasted{…}, to::CUDA.CuArray{…})
    @ Main ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:125
 [23] copyto!
    @ ~/CliMA/ClimaCore.jl/broadcast_inference_repro.jl:88 [inlined]
 [24] materialize!
    @ ./broadcast.jl:914 [inlined]
 [25] materialize!(dest::VF{Float64, 4, CUDA.CuArray{…}}, bc::Broadcasted{VFStyle{…}, NTuple{…}, typeof(foo), Tuple{…}})
    @ Base.Broadcast ./broadcast.jl:911
 [26] macro expansion
    @ ~/.julia/packages/GPUCompiler/GnbhK/src/reflection.jl:185 [inlined]
 [27] top-level scope
    @ REPL[30]:1
Some type information was truncated. Use `show(err)` to see complete types.

julia> 
version info:
julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.5
NVIDIA driver 555.42.2

CUDA libraries: 
- CUBLAS: 12.5.2
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+555.42.2

Julia packages: 
- CUDA: 5.6.0
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.10.7
- LLVM: 15.0.7

Manifest.

@maleadt
Copy link
Member

maleadt commented Jan 17, 2025

The CPU, ordinary array version of this looks like the following

What is the failing GPU version of that simple reproducer? Switching the inputs to GPU arrays works here:

julia> gb = cu(b)
5×5 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0

julia> ga = cu(a)
5×5 CuArray{Float32, 2, CUDA.DeviceMemory}:
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0

julia> gbc = instantiate(broadcasted(foo, gb, p1, p2));

julia> materialize!(ga, gbc)
5×5 CuArray{Float32, 2, CUDA.DeviceMemory}:
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0  -1.0  -1.0

In any case, the inference failure can manifest in the CPU case as well, it just executes with dynamic calls.

@maleadt maleadt added upstream Somebody else's problem. and removed bug Something isn't working labels Jan 17, 2025
@charleskawczynski
Copy link
Contributor Author

I'll copy it from the gist here:

#=
using Revise; include("cuda_broadcast_inference_reproducer.jl")
julia --project=.buildkite
julia --project=.buildkite cuda_broadcast_inference_reproducer.jl
julia +1.11 --project=.buildkite cuda_broadcast_inference_reproducer.jl
=#

@show VERSION
@static if !(VERSION  v"1.11.0-beta")
    using JET;
end
import CUDA # comment to run without CUDA
using Test
import Adapt
import Base
import Base.Broadcast: BroadcastStyle,
	Broadcasted, instantiate, broadcasted, materialize, materialize!

struct VF{S <: AbstractFloat, Nv, A}
    array::A
end
struct VFStyle{Nv, A} <: Base.BroadcastStyle end

function VF{S, Nv}(array::AbstractArray{T, 2}) where {S, Nv, T}
    @assert size(array, 1) == Nv
    @assert size(array, 2) == typesize(T, S)
    VF{S, Nv, typeof(array)}(array)
end

function VF{S}(
    ::Type{ArrayType};
    Nv::Integer,
) where {S, ArrayType}
    Nf = typesize(eltype(ArrayType), S)
    array = similar(ArrayType, Nv, Nf)
    fill!(array, 0)
    VF{S, Nv}(array)
end

typesize(::Type{T}, ::Type{S}) where {T, S} = div(sizeof(S), sizeof(T))
parent_array_type(::Type{<:Array{T}}) where {T} = Array{T}
Base.eltype(::Type{<:VF{S}}) where {S} = S
Base.parent(data::VF) = getfield(data, :array)
Base.similar(data::VF{S}) where {S} = similar(data, S)
@inline Base.size(data::VF, i::Integer) = size(data)[i]
@inline Base.size(data::VF{S, Nv}) where {S, Nv} = (1, 1, 1, Nv, 1)
Base.length(data::VF{S, Nv}) where {S, Nv} = Nv
Base.lastindex(data::VF) = length(data)
Base.copy(data::VF{S, NV}) where {S, NV} = VF{S, NV}(copy(parent(data)))
Base.Broadcast.BroadcastStyle(::Type{VF{S, Nv, A}}) where {S, Nv, A} = VFStyle{Nv, parent_array_type(A)}()
Base.Broadcast.BroadcastStyle(::Base.Broadcast.Style{<:Tuple}, ds::VFStyle) = ds
Base.Broadcast.broadcastable(data::VF) = data
Adapt.adapt_structure(to, data::VF{S, NV}) where {S, NV} = VF{S, NV}(Adapt.adapt(to, parent(data)))
@inline parent_array_type(::Type{VF{S, Nv, A}}) where {S, Nv, A} = A
Base.ndims(data::VF) = Base.ndims(typeof(data))
Base.ndims(::Type{T}) where {T <: VF} = Base.ndims(parent_array_type(T))

function Base.similar(
    bc::Union{Base.Broadcast.Broadcasted{VFStyle{Nv, A}}, VF{S, Nv, A}},
    ::Type{S},
) where {Nv, A, S}
    PA = parent_array_type(A)
    array = similar(PA, (Nv, typesize(eltype(A), S)))
    return VF{S, Nv}(array)
end

@inline function Base.getindex(
    data::VF{S, Nv},
    I::CartesianIndex,
) where {S, Nv}
    @boundscheck 1 <= I.I[4] <= Nv || throw(BoundsError(data, I))
    return parent(data)[I.I[4], 1]
end

@inline function Base.setindex!(
    data::VF{S, Nv},
    val,
    I::CartesianIndex,
) where {S, Nv}
    @boundscheck 1 <= I.I[4] <= Nv || throw(BoundsError(data, I))
    parent(data)[I.I[4], 1] = val
end

function Base.copyto!(
    dest::VF{S},
    bc::Union{VF, Base.Broadcast.Broadcasted},
) where {S}
    Base.copyto!(dest, bc, parent(dest))
    dest
end

function Base.copyto!(
    dest::VF{S, Nv},
    bc::Union{Base.Broadcast.Broadcasted{VFStyle{Nv, A}}, VF{S, Nv, A}},
    ::Array,
) where {S, Nv, A}
    @inbounds for v in 1:Nv
        idx = CartesianIndex(1, 1, 1, v, 1)
        dest[idx] = convert(S, bc[idx])
    end
    return dest
end

# Extension
@static if @isdefined(CUDA)
    
    parent_array_type(::Type{<:CUDA.CuArray{T, N, B} where {N}}) where {T, B} = CUDA.CuArray{T, N, B} where {N}
    Base.similar(
        ::Type{CUDA.CuArray{T, N′, B} where {N′}},
        dims::Dims{N},
    ) where {T, N, B} = similar(CUDA.CuArray{T, N, B}, dims)

    function knl_copyto!(dest::VF{S, Nv}, src) where {S, Nv}
        (tv,) = CUDA.threadIdx()
        (bv,) = CUDA.blockIdx()
        v = tv + (bv - 1) * CUDA.blockDim().x
        I = CartesianIndex((1, 1, 1, v, 1))
        if 1  I.I[4]  Nv
            @inbounds dest[I] = src[I]
        end
        return nothing
    end

    function Base.copyto!(dest::VF{S, Nv}, bc, to::CUDA.CuArray) where {S, Nv}
        kernel = CUDA.@cuda always_inline = true launch = false knl_copyto!(dest, bc)
        config = CUDA.launch_configuration(kernel.fun)

        n_max_threads = min(config.threads, Nv)
        Nvt = fld(n_max_threads, Nv)
        Nv_thread = Nvt == 0 ? n_max_threads : min(Int(Nvt), Nv)
        Nv_blocks = cld(Nv, Nv_thread)
        @assert Nv_thread  n_max_threads "threads,n_max_threads=($(Nv_thread),$n_max_threads)"
        p = (; threads = (Nv_thread,), blocks = (Nv_blocks,))

        kernel(dest, bc; threads = p.threads, blocks = p.blocks)
        return dest
    end
end

struct MyParams1{A}
  a::A
end;
struct MyParams2{B}
  b::B
end;
Base.Broadcast.broadcastable(x::MyParams1) = tuple(x);
Base.Broadcast.broadcastable(x::MyParams2) = tuple(x);

foo(f, p1, p2) = f + p1.a - p2.b;
bar(p1, p2, f) = f + p1.a - p2.b;

FT = Float64;
p1 = MyParams1{FT}(1);
p2 = MyParams2{FT}(2);

@testset "Broken test" begin
    b = zeros(FT, 5,5); # Ordinary CPU array works
    a = similar(b);
    bc = instantiate(broadcasted(foo, b, p1, p2));
    materialize!(a, bc)
    @static if !(VERSION  v"1.11.0-beta")
        @test_opt materialize!(a, bc) # also passes inference
    end

    b = VF{FT}(Array{FT}; Nv=4); # VF with CPU array works
    a = similar(b);
    bc = instantiate(broadcasted(foo, b, p1, p2));
    materialize!(a, bc)
    # @code_warntype materialize!(a, bc) # looks fine
    @static if !(VERSION  v"1.11.0-beta")
        @test_opt materialize!(a, bc) # also passes inference
    end

    @static if @isdefined(CUDA)
        b = CUDA.zeros(FT, 5,5); # CUDA.CuArray works
        a = similar(b);
        bc = instantiate(broadcasted(foo, b, p1, p2));
        materialize!(a, bc)

        b = VF{FT}(CUDA.CuArray{FT}; Nv=4); # VF with CUDA.CuArray fails
        a = similar(b);
        bc = instantiate(broadcasted(foo, b, p1, p2));
        @test_throws CUDA.InvalidIRError materialize!(a, bc) # fails to compile
        # CUDA.@device_code_warntype materialize!(a, bc)
    end
end

#=
# re-run the last, breaking, part:
b = VF{FT}(CUDA.CuArray{FT}; Nv=4); # VF with CUDA.CuArray fails
a = similar(b);
bc = instantiate(broadcasted(foo, b, p1, p2));
materialize!(a, bc) # fails to compile
=#
nothing

Note the @test_throws CUDA.InvalidIRError materialize!(a, bc) # fails to compile line near the end.

@charleskawczynski
Copy link
Contributor Author

In any case, the inference failure can manifest in the CPU case as well, it just executes with dynamic calls.

I suppose that's possible, but I don't think it is because it passes JET.@test_opt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream Somebody else's problem.
Projects
None yet
Development

No branches or pull requests

2 participants