Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-grained fast-math flags #1991

Open
lcw opened this issue Jul 5, 2023 · 11 comments · May be fixed by #2037
Open

Fine-grained fast-math flags #1991

lcw opened this issue Jul 5, 2023 · 11 comments · May be fixed by #2037
Labels
cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request upstream Somebody else's problem.

Comments

@lcw
Copy link
Contributor

lcw commented Jul 5, 2023

Is your feature request related to a problem? Please describe.

To get kernel performance matching clang we have had to add fast-math flags such as contract (which clang and nvcc do by default). Currently, we do this by an ugly-hack, see for example

# HACK: module-local versions of core arithmetic; needed to get FMA
for (jlf, f) in zip((:+, :*, :-), (:add, :mul, :sub))
for (T, llvmT) in ((:Float32, "float"), (:Float64, "double"))
ir = """
%x = f$f contract nsz $llvmT %0, %1
ret $llvmT %x
"""
@eval begin
# the @pure is necessary so that we can constant propagate.
@inline Base.@pure function $jlf(a::$T, b::$T)
Base.llvmcall($ir, $T, Tuple{$T, $T}, a, b)
end
end
end
@eval function $jlf(args...)
Base.$jlf(args...)
end
end
let (jlf, f) = (:div_arcp, :div)
for (T, llvmT) in ((:Float32, "float"), (:Float64, "double"))
ir = """
%x = f$f fast $llvmT %0, %1
ret $llvmT %x
"""
@eval begin
# the @pure is necessary so that we can constant propagate.
@inline Base.@pure function $jlf(a::$T, b::$T)
Base.llvmcall($ir, $T, Tuple{$T, $T}, a, b)
end
end
end
@eval function $jlf(args...)
Base.$jlf(args...)
end
end
rcp(x) = div_arcp(one(x), x) # still leads to rcp.rn which is also a function call

Describe the solution you'd like

I would like a macro like @fastmath that had fine-grained control over the fast-math flags.

Describe alternatives you've considered

KernelAbstractions used to do this with https://github.com/JuliaLabs/Cassette.jl and other people use macros (although it opens up less optimization and thus not desired) https://github.com/JuliaLabs/Cassette.jl. I don't know if https://github.com/JuliaDebug/CassetteOverlay.jl can be used with kernels but it might be a possible way to implement this.

It would be nice if this functionality eventually got added to base julia.

@lcw lcw added the enhancement New feature or request label Jul 5, 2023
@maleadt
Copy link
Member

maleadt commented Jul 6, 2023

It would be nice if this functionality eventually got added to base julia.

I agree, so better file this on the Julia repository?

@lcw
Copy link
Contributor Author

lcw commented Jul 6, 2023

Looks like there already is at least one JuliaLang/julia#49890.

@maleadt
Copy link
Member

maleadt commented Jul 6, 2023

I think we can close this issue then?

@lcw
Copy link
Contributor Author

lcw commented Jul 6, 2023

Sure.

@lcw lcw closed this as completed Jul 6, 2023
@vchuravy
Copy link
Member

vchuravy commented Jul 7, 2023

I was thinking we could do @cuda math=(:contract, :reassoc) and then use a overlay table to switch the implementation.

@lcw lcw reopened this Jul 7, 2023
@lcw
Copy link
Contributor Author

lcw commented Jul 7, 2023

I like that idea. So all of the code in the kernel (even within function calls) would use contract and reassoc?

@maleadt maleadt added the cuda kernels Stuff about writing CUDA kernels. label Jul 7, 2023
@vchuravy
Copy link
Member

vchuravy commented Jul 7, 2023

Yeah, kinda inspired by JuliaLang/julia#50239, I think we could solve this with stacked OverlayMethodTables

@maleadt
Copy link
Member

maleadt commented Aug 14, 2023

I think it would be better to prototype this in an external package, and have CUDA.jl use that package's overlay table. That way the functionality wouldn't be locked into the CUDA.jl ecosystem either.

@maleadt maleadt added the upstream Somebody else's problem. label Aug 14, 2023
@lcw
Copy link
Contributor Author

lcw commented Aug 16, 2023

Something akin to https://github.com/JuliaSIMD/LLVMLoopInfo.jl? That would be great. Is the idea to use CassetteOverlay to create some standard passes for each fast-math flag and the use those in the kernels via macros? I am not sure how to stack these for combining fast-math flags.

@vchuravy
Copy link
Member

vchuravy commented Aug 16, 2023

No more like https://github.com/vchuravy/FastmathOverlay.jl

I don't have a good solution for combining flags... yet.

@vchuravy
Copy link
Member

Okay #2037 is a prototype of that idea. Now that we know it is feasible we have to decide if we like it.

Composition of certain things is possible, and for other things it is tedious.

As an example say you want to opt into :contract on all floating-point ops and we add a speculative :fast_trig.
That should work fine since we can form StackedMethodTable(FastTrig, StackedMethodTable(Contract, CUDA)).

We sadly can't use the same method for composing :contract and :reassoc. Since the definition in the outer one, will shadow the definition in the inner one. This also means that for :FastTrig we may want something like :FastTrigCUDA since we will otherwise shadow the CUDA definitions.

Right now the only idea I have for :contract & :reassoc is the tedious solution to manually (or through meta-programming) create a method table Contract×Reassoc that defines the combinations we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request upstream Somebody else's problem.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants