Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the GPUCompiler interface to avoid needless compiler specialization #227

Closed
wants to merge 8 commits into from

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jul 26, 2021

I'm trying to make it possible to precompile most of the compiler, but it's proving to be hard. First, I removed the function from the CompilerJob, since it's otherwise too easy to invalidate the precompilation results. I think that we also have to use @invokelatest with all of the GPUCompiler interfaces, because even with type assertions and @noinline the generic implementations of the interface are referred to literally in the compiled code.

However, I'm still seeing emit_llvm getting re-compiled when using CUDA.jl, even though I don't immediately spot invalidations that would explain this (is it possible to go backwards -- start from a method that got invalidated to find the definition that invalidated it?).
Strangely, the method instance also lists the specific compiler target, even though job is passed as @nospecialize in emit_llvm...

MethodInstance for GPUCompiler.emit_llvm(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, ::Core.MethodInstance, ::Bool, ::Bool, ::Bool, ::Bool)

@maleadt maleadt changed the title More latency fixes. Rework the GPUCompiler interface to avoid needless compiler specialization Jul 27, 2021
@maleadt
Copy link
Member Author

maleadt commented Jul 27, 2021

Turns out @nospecialize is only for codegen, and we don't have a @noinfer. So instead I reworked the API: removing all typevars from CompilerJob, instead introducing a parametric Compiler that contains the target and params for dispatch. Overridable interfaces will need to use that Compiler object, whereas the parts of GPUCompiler that shouldn't specialize based on the active compiler can continue to use the CompilerJob objects.

@maleadt
Copy link
Member Author

maleadt commented Jul 28, 2021

Despite the refactor, I'm still seeing excessive compilation by Julia when launching the first CUDA.jl kernel. Going to summarize my findings here so that I link this to a couple of people.

The problem is that certain methods which are covered by GPUCompiler's precompilation directives, get recompiled when using them from CUDA.jl. That used to be caused by specialization, and due to invalidations, but I think I have eliminated those in this PR (and the CUDA.jl counterpart in JuliaGPU/CUDA.jl#1066). As an example, let's look at the emit_llvm function:

julia> using GPUCompiler, MethodAnalysis

julia> mi = only(methodinstances(GPUCompiler.emit_llvm))
MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance)

# a method was precompiled

julia> mi.cache
Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), #undef, 0x00000000000079fd, 0x0000000000000000, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88  …  0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, true, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000)

julia> mi.cache.min_world
0x00000000000079fd

julia> mi.cache.max_world
0x0000000000000000

IIUC, loading CUDA.jl does not invalidate that method instance since the world bounds of the cached code instance remain the same:

julia> using CUDA

julia> mi2 = only(methodinstances(GPUCompiler.emit_llvm))
MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance)

julia> mi2 === mi
true

# precompilation result still valid

julia> mi.cache
Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), #undef, 0x00000000000079fd, 0x0000000000000000, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88  …  0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, true, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000)

julia> mi.cache.min_world
0x00000000000079fd

julia> mi.cache.max_world
0x0000000000000000

HOWEVER, when I actually trigger compilation, the CI cache gets a second entry! The only thing different between these two CIs is the precompiled field, and the max_world bound:

julia> @cuda identity(nothing)
CUDA.HostKernel{typeof(identity), Tuple{Nothing}}(identity, CuContext(0x00000000019ee6d0, instance b1996a79f885a5d8), CuModule(Ptr{Nothing} @0x00000000037c7100, CuContext(0x00000000019ee6d0, instance b1996a79f885a5d8)), CuFunction(Ptr{Nothing} @0x0000000004f97390, CuModule(Ptr{Nothing} @0x00000000037c7100, CuContext(0x00000000019ee6d0, instance b1996a79f885a5d8))))

julia> mi3 = only(methodinstances(GPUCompiler.emit_llvm))
MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance)

julia> mi3 === mi
true

julia> mi.cache
Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), #undef, 0x00000000000079fd, 0x0000000000000000, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88  …  0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, true, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000), 0x00000000000079fd, 0xffffffffffffffff, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88  …  0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, false, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000)

julia> for field in (:def, :inferred, :invoke, :isspecsig, :max_world, :min_world, :rettype, :specptr, :precompile)
         @show field getfield(mi.cache, field) == getfield(mi.cache.next, field)
       end
field = :def
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :inferred
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :invoke
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :isspecsig
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :max_world
getfield(mi.cache, field) == getfield(mi.cache.next, field) = false
field = :min_world
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :rettype
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :specptr
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :precompile
getfield(mi.cache, field) == getfield(mi.cache.next, field) = false

julia> mi.cache.precompile
false

julia> mi.cache.max_world
0xffffffffffffffff

julia> mi.cache.next.precompile
true

julia> mi.cache.next.max_world
0x0000000000000000

I don't understand why we are inferring a new version of emit_llvm here...

@timholy
Copy link
Member

timholy commented Jul 31, 2021

Need @noinfer

You can achieve this with the following combination:

f(@nospecialize(x)) = 1
g(x) = f(Base.inferencebarrier(x))   # the causes inference to use `Any` as the type of `x` for inferring `f`

I don't understand why we are inferring a new version of emit_llvm here...

Possibly const-prop? Try breaking that by push/pop to a Ref maybe?

@maleadt
Copy link
Member Author

maleadt commented Mar 15, 2023

After recent refactor of CompilerJob, I don't see unnecessary specialization anymore:

# define a new kernel
julia> bar(x) = nothing
bar (generic function with 1 method)

julia> Metal.mtlfunction(bar, Tuple{Nothing})
precompile(Tuple{typeof(GPUCompiler.get_world_generator), Any, Type{Type{typeof(Main.bar)}}, Type{Type{Tuple{Nothing}}}})
precompile(Tuple{typeof(Metal.mtlfunction), typeof(Main.bar), Type{Tuple{Nothing}}})
precompile(Tuple{typeof(Base.vect), Type{typeof(Main.bar)}, Vararg{DataType}})
precompile(Tuple{Type{Metal.HostKernel{typeof(Main.bar), Tuple{Nothing}}}, Function, Metal.MTL.MTLComputePipelineStateInstance})
precompile(Tuple{typeof(Base.show), Base.IOContext{Base.TTY}, Base.Multimedia.MIME{Symbol("text/plain")}, Metal.HostKernel{typeof(Main.bar), Tuple{Nothing}}})
precompile(Tuple{typeof(Base.sizeof), Metal.HostKernel{typeof(Main.bar), Tuple{Nothing}}})
Metal.HostKernel{typeof(bar), Tuple{Nothing}}(bar, Metal.MTL.MTLComputePipelineStateInstance (object of type AGXG13XFamilyComputePipeline))

@maleadt maleadt closed this Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants