Rework the GPUCompiler interface to avoid needless compiler specialization #227

maleadt · 2021-07-26T15:08:52Z

I'm trying to make it possible to precompile most of the compiler, but it's proving to be hard. First, I removed the function from the CompilerJob, since it's otherwise too easy to invalidate the precompilation results. I think that we also have to use @invokelatest with all of the GPUCompiler interfaces, because even with type assertions and @noinline the generic implementations of the interface are referred to literally in the compiled code.

However, I'm still seeing emit_llvm getting re-compiled when using CUDA.jl, even though I don't immediately spot invalidations that would explain this (is it possible to go backwards -- start from a method that got invalidated to find the definition that invalidated it?).
Strangely, the method instance also lists the specific compiler target, even though job is passed as @nospecialize in emit_llvm...

MethodInstance for GPUCompiler.emit_llvm(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, ::Core.MethodInstance, ::Bool, ::Bool, ::Bool, ::Bool)

maleadt · 2021-07-27T09:46:37Z

Turns out @nospecialize is only for codegen, and we don't have a @noinfer. So instead I reworked the API: removing all typevars from CompilerJob, instead introducing a parametric Compiler that contains the target and params for dispatch. Overridable interfaces will need to use that Compiler object, whereas the parts of GPUCompiler that shouldn't specialize based on the active compiler can continue to use the CompilerJob objects.

maleadt · 2021-07-28T06:24:07Z

Despite the refactor, I'm still seeing excessive compilation by Julia when launching the first CUDA.jl kernel. Going to summarize my findings here so that I link this to a couple of people.

The problem is that certain methods which are covered by GPUCompiler's precompilation directives, get recompiled when using them from CUDA.jl. That used to be caused by specialization, and due to invalidations, but I think I have eliminated those in this PR (and the CUDA.jl counterpart in JuliaGPU/CUDA.jl#1066). As an example, let's look at the emit_llvm function:

julia> using GPUCompiler, MethodAnalysis

julia> mi = only(methodinstances(GPUCompiler.emit_llvm))
MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance)

# a method was precompiled

julia> mi.cache
Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), #undef, 0x00000000000079fd, 0x0000000000000000, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88  …  0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, true, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000)

julia> mi.cache.min_world
0x00000000000079fd

julia> mi.cache.max_world
0x0000000000000000

IIUC, loading CUDA.jl does not invalidate that method instance since the world bounds of the cached code instance remain the same:

julia> using CUDA

julia> mi2 = only(methodinstances(GPUCompiler.emit_llvm))
MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance)

julia> mi2 === mi
true

# precompilation result still valid

julia> mi.cache
Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), #undef, 0x00000000000079fd, 0x0000000000000000, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88  …  0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, true, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000)

julia> mi.cache.min_world
0x00000000000079fd

julia> mi.cache.max_world
0x0000000000000000

HOWEVER, when I actually trigger compilation, the CI cache gets a second entry! The only thing different between these two CIs is the precompiled field, and the max_world bound:

julia> @cuda identity(nothing)
CUDA.HostKernel{typeof(identity), Tuple{Nothing}}(identity, CuContext(0x00000000019ee6d0, instance b1996a79f885a5d8), CuModule(Ptr{Nothing} @0x00000000037c7100, CuContext(0x00000000019ee6d0, instance b1996a79f885a5d8)), CuFunction(Ptr{Nothing} @0x0000000004f97390, CuModule(Ptr{Nothing} @0x00000000037c7100, CuContext(0x00000000019ee6d0, instance b1996a79f885a5d8))))

julia> mi3 = only(methodinstances(GPUCompiler.emit_llvm))
MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance)

julia> mi3 === mi
true

julia> mi.cache
Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), #undef, 0x00000000000079fd, 0x0000000000000000, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88  …  0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, true, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000), 0x00000000000079fd, 0xffffffffffffffff, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88  …  0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, false, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000)

julia> for field in (:def, :inferred, :invoke, :isspecsig, :max_world, :min_world, :rettype, :specptr, :precompile)
         @show field getfield(mi.cache, field) == getfield(mi.cache.next, field)
       end
field = :def
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :inferred
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :invoke
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :isspecsig
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :max_world
getfield(mi.cache, field) == getfield(mi.cache.next, field) = false
field = :min_world
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :rettype
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :specptr
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :precompile
getfield(mi.cache, field) == getfield(mi.cache.next, field) = false

julia> mi.cache.precompile
false

julia> mi.cache.max_world
0xffffffffffffffff

julia> mi.cache.next.precompile
true

julia> mi.cache.next.max_world
0x0000000000000000

I don't understand why we are inferring a new version of emit_llvm here...

timholy · 2021-07-31T17:54:25Z

Need @noinfer

You can achieve this with the following combination:

f(@nospecialize(x)) = 1
g(x) = f(Base.inferencebarrier(x))   # the causes inference to use `Any` as the type of `x` for inferring `f`

I don't understand why we are inferring a new version of emit_llvm here...

Possibly const-prop? Try breaking that by push/pop to a Ref maybe?

maleadt · 2023-03-15T12:09:28Z

After recent refactor of CompilerJob, I don't see unnecessary specialization anymore:

# define a new kernel
julia> bar(x) = nothing
bar (generic function with 1 method)

julia> Metal.mtlfunction(bar, Tuple{Nothing})
precompile(Tuple{typeof(GPUCompiler.get_world_generator), Any, Type{Type{typeof(Main.bar)}}, Type{Type{Tuple{Nothing}}}})
precompile(Tuple{typeof(Metal.mtlfunction), typeof(Main.bar), Type{Tuple{Nothing}}})
precompile(Tuple{typeof(Base.vect), Type{typeof(Main.bar)}, Vararg{DataType}})
precompile(Tuple{Type{Metal.HostKernel{typeof(Main.bar), Tuple{Nothing}}}, Function, Metal.MTL.MTLComputePipelineStateInstance})
precompile(Tuple{typeof(Base.show), Base.IOContext{Base.TTY}, Base.Multimedia.MIME{Symbol("text/plain")}, Metal.HostKernel{typeof(Main.bar), Tuple{Nothing}}})
precompile(Tuple{typeof(Base.sizeof), Metal.HostKernel{typeof(Main.bar), Tuple{Nothing}}})
Metal.HostKernel{typeof(bar), Tuple{Nothing}}(bar, Metal.MTL.MTLComputePipelineStateInstance (object of type AGXG13XFamilyComputePipeline))

maleadt added 5 commits July 26, 2021 17:03

Don't specialize the CompilerJob on the target function and type.

03e61f4

Update precompilation directives.

9731a22

Use at-invokelatest to call GPUCompiler interface methods.

f02b42e

Try to precompile emit_llvm.

a64c804

Rework the interface to avoid needless specialization.

2f94ff2

maleadt changed the title ~~More latency fixes.~~ Rework the GPUCompiler interface to avoid needless compiler specialization Jul 27, 2021

maleadt added 2 commits July 27, 2021 14:05

Add a TODO.

338ecd6

Compile some code for precompilation.

6f3b4ca

maleadt force-pushed the tb/latency branch from 9923933 to 6f3b4ca Compare July 27, 2021 14:47

XXX: emit_llvm isn't covered by the simple compilation?

8cdbe68

maleadt mentioned this pull request Aug 19, 2021

introduce @nospecializeinfer macro to tell the compiler to avoid excess inference JuliaLang/julia#41931

Merged

maleadt closed this Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework the GPUCompiler interface to avoid needless compiler specialization #227

Rework the GPUCompiler interface to avoid needless compiler specialization #227

maleadt commented Jul 26, 2021

maleadt commented Jul 27, 2021 •

edited

Loading

maleadt commented Jul 28, 2021

timholy commented Jul 31, 2021

maleadt commented Mar 15, 2023

Rework the GPUCompiler interface to avoid needless compiler specialization #227

Rework the GPUCompiler interface to avoid needless compiler specialization #227

Conversation

maleadt commented Jul 26, 2021

maleadt commented Jul 27, 2021 • edited Loading

maleadt commented Jul 28, 2021

timholy commented Jul 31, 2021

maleadt commented Mar 15, 2023

maleadt commented Jul 27, 2021 •

edited

Loading