Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Float16 to LLVM's half #37510

Merged
merged 6 commits into from
Oct 15, 2020
Merged

Switch Float16 to LLVM's half #37510

merged 6 commits into from
Oct 15, 2020

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Sep 10, 2020

This PR builds on @vchuravy's many experiments switching to LLVM's half type for Float16 instead of our current i16 software implementation, and the accompanying integration with (a Julia version of) compiler-rt to implement the necessary intrinsics.
My motivation for this is to improve GPU codegen, where the i16 software implementation hurts us quite a bit.
Ref #26381 #18734 #18927

In short:

  • switch codegen of Float16 over to half from i16
  • move the software implementation to a Runtime module
  • use @ccallable to expose those functions under the expected intrinsic names

I went with a Julia runtime library, and not the Real Thing (tm), such that these changes are minimally invasive and just move the current software implementation around. We can always decide later to use LLVM's version. On OSX however, xcode already links LLVM's compiler-rt, so I disabled registration of our runtime functions there (alternatively, @vchuravy mentions we could customize the library names in the TLI so that we use the same software implementation across platforms).

TODO:

range tests fail locally due to this:

function muladd(x, y, z)
    x*y + z
end
muladd(Float16(0.5283), Float16(0.584), Float16(-6.294e-5))
# Float16(0.3086) on master
# Float16(0.3083) on this PR

0.3083 is more accurate (Float16(0.5283 * 0.584 + -6.294e-5) == Float16(0.3083)), but I'd think that shouldn't happen as the result differs from computing incrementally:

julia> Float16(0.5283) * Float16(0.584)
Float16(0.3086)

julia> ans + Float16(-6.294e-5)
Float16(0.3086)

I don't see any muladd combination happening at the LLVM level, but the generated assembly performs the arithmetic in single instead of half precision (vmulss + vaddss).

Thoughts? Is this legal, and should the tests be fixed? @vchuravy mentions that the same happens with FP32 operations benefiting from x87's extended 80-bit precision, and the same reasoning could be applied here.

@maleadt maleadt added compiler:codegen Generation of LLVM IR and native code gpu Affects running Julia on a GPU labels Sep 10, 2020
@maleadt
Copy link
Member Author

maleadt commented Sep 10, 2020

Ah this fails bootstrap at float.jl already with JULIA_CPU_TARGET="generic"; I'll have to shuffle some code around.

@maleadt
Copy link
Member Author

maleadt commented Sep 13, 2020

Looking into the range test failure, it seems like parts of TwicePrecision are broken with this PR:

julia> x,y = Float16(0.5635), Float16(0.6133)
(Float16(0.5635), Float16(0.6133))

julia> Base.mul12(x,y)
(Float16(0.3457), 0.0)

Whereas before:

julia> x,y = Float16(0.5635), Float16(0.6133)
(Float16(0.5635), Float16(0.6133))

julia> Base.mul12(x,y)
(Float16(0.3455), Float16(0.0001106))

... which is more correct (BigFloat(x) * BigFloat(y) = 0.345569610595703125). The error seems to originate from canonicalize2:

julia> julia> fma(x,y,-(x*y))
Float16(0.0001106)

# before
julia> Base.canonicalize2(x*y, Float16(0.0001106))
(Float16(0.3455), Float16(0.0001106))

# after
julia> Base.canonicalize2(x*y, Float16(0.0001106))
(Float16(0.3455), 0.0)

# after, but in a function (changing intermediate precision)
julia> f(x,y) = Base.canonicalize2(x*y, Float16(0.0001106))
f (generic function with 2 methods)

julia> f(x,y)
(Float16(0.3457), 0.0)

I need to look at this some more, but @timholy maybe you have some thoughts already, as you wrote those utilities & range tests.

@yuyichao
Copy link
Contributor

Note that other than multiversioning issue, this is really going to throw off the GC root analysis. We assume normal llvm intrinsics do not call managed code and this can easily break that (though it'll certainly be fine most of the time.). It doesn't seem that any of these are too complex for C implementation (and I assume they usually are implemented in C anyway).

@maleadt maleadt force-pushed the tb/half branch 3 times, most recently from bf1c999 to f9c0c1e Compare October 6, 2020 12:42
src/intrinsics.cpp Outdated Show resolved Hide resolved
src/intrinsics.cpp Outdated Show resolved Hide resolved
@maleadt
Copy link
Member Author

maleadt commented Oct 6, 2020

Ported the Float16 software intrinsics to C per @yuyichao's request.

The remaining failures are in the ranges test suite, caused by the twice-precision arithmetic breaking down as Float16 calculations are actually performed in Float32 (so canonicalize2 and friends don't really work anymore). I couldn't find a way to convince LLVM to preserve 16-bit accuracy, or a way to get the current implementation working (except for forcing everything @noinline as values are returned in 16-bit registers, but that isn't an option of course). Maybe @timholy or @simonbyrne have some thoughts? FWIW, the range mentioned at the top of twiceprecision.jl works, (Float16(0.1):Float16(0.1):Float16(0.3))[3] == Float16(0.3).

@maleadt
Copy link
Member Author

maleadt commented Oct 7, 2020

@nanosoldier runtests(ALL, vs = ":master")

@maleadt
Copy link
Member Author

maleadt commented Oct 7, 2020

Looking a bit closer at the twice precision code used by ranges:

# Use TwicePrecision only for Float64; use Float64 for T<:Union{Float16,Float32}

So IIUC Float16 ranged don't even use TwicePrecision (all of the constructors I tested indeed used Float64 to increase precision), so I've disabled the (now) broken tests of those utilities on Float16 values.

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here. cc @maleadt

@maleadt
Copy link
Member Author

maleadt commented Oct 7, 2020

@nanosoldier runtests(["BitIntegers", "BoltzmannMachines", "DIVAnd", "FastTransforms", "Graph500", "ImageFeatures", "IntervalTrees", "MIToS", "QuadGK", "RandomizedPropertyTest", "Revise", "Sherlogs", "ThreadPools", "YAActL"], vs = ":master")

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here. cc @maleadt

@maleadt
Copy link
Member Author

maleadt commented Oct 7, 2020

Looking into test failures:

  • BitIntegers: has special casing for Float16 that can just be removed. Relied on the Float16(x::Integer) we used to have, where for other floating-point types we only have conversions to Base's integer types. Will fix.
  • FastTransforms: legit issue, reduced to sinpi(Float16(-0.5)) returning different results. Will fix.
  • ImageFeatures: FREAK algorithm issues because results are now more accurate: Float16(743.5) - Float16(0.9165) = Float16(742.5), which used to be rounded to 742, but now the intermediate computation happens in Float32 which yields 742.5835 and rounds to 743. That seems to cause significant differences in final results, though (cc @timholy).
  • MIToS: small accuracy difference in final result of a test, 15.2 vs 15.21 (cc @diegozea)
  • QuadGK: small accuracy differences in final result of a test, 0.1047 vs 0.105, 0.4504 vs 0.451.

Debugging that sinpi difference, turns out we have other TwicePrecision-like operations:

rx = x-((x+t)-t) # zeros may be incorrectly signed

# expected result:
x = Float16(0.5)
s = maxintfloat(T) / 2 = Float16(1.024e3)
t = 3s = Float16(3.072e3)
rx = x-((x+t)-t) = Float16(0.5)

# with Float16 as half (which performs operations as Float32 on X86):
x = Float16(0.5)
s = maxintfloat(T) / 2 = Float16(1.024e3)
t = 3s = Float16(3.072e3)
rx = x-((x+t)-t) = 0.0

@simonbyrne
Copy link
Contributor

So LLVM uses Float32 intermediate precision? I wonder what Swift does here, since they now offer Float16 support as well.

If that is the case, I think the easiest solution would be to promote to Float32 wherever we use these extended precision tricks.

@simonbyrne
Copy link
Contributor

Alternatively we could convert to Float32 and back again for all functions (which will give correctly rounded results for all basic arithmetic ops except fma). I'll try to explore this a bit tonight.

@vchuravy
Copy link
Member

vchuravy commented Oct 8, 2020

Looking at the PRs that added Float16 they generally seem to promote to Float32 for Float16. (apple/swift-numerics#136). CUDA has a similar strategy in that they use faster less precise Float32 routines and then convert to Float16 + fix some of the bit patterns. I haven't seen anyone really attempting to define Float16 math generically.

But yes LLVM performs Float16 operations in extended precision (since there are no Float16 registers), and if my understanding is correct that is indeed legal...

@simonbyrne
Copy link
Contributor

But yes LLVM performs Float16 operations in extended precision (since there are no Float16 registers), and if my understanding is correct that is indeed legal...

I'm not sure what you mean by legal here: we currently define all operations to return results of the same type as their inputs. Switching to allowing higher intermediate precision would be a significant breaking change as it fundamentally breaks the semantics of the language (as this PR shows).

Can you provide any links on how CUDA handles this? If there is no way to obtain performant code on GPUs without weakening these guarantees then its worth considering, but it is worth noting that Swift explicitly also provides the same strict behaviour.

vchuravy and others added 3 commits October 14, 2020 11:59
The Julia runtime wasn't safe wrt. GC operations.
I left the commit in for archival purposes, in case
we want to revisit this again later.
@maleadt maleadt force-pushed the tb/half branch 3 times, most recently from 34daf18 to 2c77780 Compare October 14, 2020 11:56
@maleadt
Copy link
Member Author

maleadt commented Oct 14, 2020

This should be complete now, and passes all tests locally. Let's do another PkgEval run:

@nanosoldier runtests(ALL, vs = ":master")

#define fpext(pr, a) \
if (!(osize > 8 * sizeof(a))) \
jl_error("fpext: output bitsize must be > input bitsize"); \
if (!(osize >= 8 * sizeof(a))) \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only behavioral change, now allowing no-op fpext to simplify the implementation of Float16 intrinsics. Shouldn't matter in practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s invalid for llvm IR, so I’d just given the same restriction here

@maleadt
Copy link
Member Author

maleadt commented Oct 14, 2020

And FWIW, moving the pass that truncates/extends Float16 operations earlier in the pipeline (to let LLVM optimize) gives the following for mul12:

@@ -1,15 +1,9 @@
 julia> @code_llvm Base.mul12(x,y)
 define [2 x half] @julia_mul12(half %0, half %1) #0 {
 top:
-    %0 = fpext half %0 to float
-    %1 = fpext half %1 to float
-    %2 = fmul float %0, %1
-    %2 = fptrunc float %2 to half
+    %2 = fmul half %0, %1
     %3 = fcmp une half %2, 0xH0000
-    %2 = fpext half %2 to float
-    %2 = fpext half %2 to float
-    %4 = fsub float %2, %2
-    %4 = fptrunc float %4 to half
+    %4 = fsub half %2, %2
     %5 = fcmp oeq half %4, 0xH0000
     %6 = fneg half %2
     %7 = fpext half %0 to float
@@ -17,18 +11,9 @@
     %9 = fpext half %6 to float
     %10 = call float @llvm.fma.f32(float %7, float %8, float %9)
     %11 = fptrunc float %10 to half
-    %2 = fpext half %2 to float
-    %11 = fpext half %11 to float
-    %12 = fadd float %2, %11
-    %12 = fptrunc float %12 to half
-    %2 = fpext half %2 to float
-    %12 = fpext half %12 to float
-    %13 = fsub float %2, %12
-    %13 = fptrunc float %13 to half
-    %13 = fpext half %13 to float
-    %11 = fpext half %11 to float
-    %14 = fadd float %13, %11
-    %14 = fptrunc float %14 to half
+    %12 = fadd half %2, %11
+    %13 = fsub half %2, %12
+    %14 = fadd half %13, %11
     %15 = and i1 %3, %5
     %.fca.0.insert7 = insertvalue [2 x half] undef, half %2, 0

Cleans up a lot, and keeps the ranges TwicePrecision tests working, but I don't think this is legal wrt. our Float16 semantics. For example, the initial mul + cmp might behave differently under Float32 extended precision. I think I'll just add a GVN pass to clean up identical conversions (instcombine is too aggressive, and is responsible for most of the changes seen above).

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! Fantastic work Tim.

Should we try turning on native Float16 support on AArch64/ARM?

In https://reviews.llvm.org/D57188 they are turned on for all AARCH64 targets.

// All AArch64 implementations support ARMv8 FP, which makes half a legal type.
HasLegalHalfType = true;
HasFloat16 = true;

CC: @yuyichao

@yuyichao
Copy link
Contributor

AFAICT half is a native type on aarch64 though arithmetics on them is an optional feature. This is why the ccall ABI for aarch64 already uses the native half type IIRC. I don't know what's the desired property here.

@maleadt
Copy link
Member Author

maleadt commented Oct 14, 2020

Happy to take a look at enabling native half on ARM, but I'd prefer to get this merged first before the 1.6 branch.

Also, this greatly improves performance already on CPUs that don't need to call to the runtime, so I guess we can close #29889:

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(1)
  median time:      97.775 ms (0.00% GC)

julia> @benchmark $(zeros(Float32 , 10^8)) .*= $(1)
  median time:      30.530 ms (0.00% GC)

julia> @benchmark $(zeros(Float64 , 10^8)) .*= $(1)
  median time:      62.900 ms (0.00% GC)

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(Float16(1.0))
  median time:      97.970 ms (0.00% GC)

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(1.0)
  median time:      240.721 ms (0.00% GC)

vs. before

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(1)
BenchmarkTools.Trial: 
  median time:      947.835 ms (0.00% GC)

julia> @benchmark $(zeros(Float32 , 10^8)) .*= $(1)
BenchmarkTools.Trial: 
  median time:      30.465 ms (0.00% GC)

julia> @benchmark $(zeros(Float64 , 10^8)) .*= $(1)
BenchmarkTools.Trial: 
  median time:      61.471 ms (0.00% GC)

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(Float16(1.0))
BenchmarkTools.Trial: 
  median time:      653.172 ms (0.00% GC)

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(1.0)
BenchmarkTools.Trial: 
  median time:      428.816 ms (0.00% GC)

@vchuravy
Copy link
Member

https://www.keil.com/support/man/docs/armclang_ref/armclang_ref_sex1519040854421.htm

Also, _Float16 arithmetic operations directly map to Armv8.2-A half-precision floating-point instructions when they are enabled on Armv8.2-A and later architectures. This avoids the need for conversions to and from single-precision floating-point, and therefore results in more performant code. If the Armv8.2-A half-precision floating-point instructions are not available, _Float16 values are automatically promoted to single-precision, similar to the semantics of __fp16 except that the results continue to be stored in single-precision floating-point format instead of being converted back to half-precision floating-point format.

So we would need to test for Armv8.2-A and +fp16? So that would be HWCAP_FPHP? Would it be sufficient to check the function attribute target-features? Since the demoteFloat16 pass is running after multi-versioning? Or does that only include the explicitly requested features?

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here. cc @maleadt

@maleadt
Copy link
Member Author

maleadt commented Oct 14, 2020

@nanosoldier runtests(["DynamicalBilliards", "ExcelReaders", "FameSVD", "FlashWeave", "Genie", "ITensors", "IntervalTrees", "LoopThrottle", "Manifolds", "MemPool", "OptimKit", "Pidfile"], vs = ":master")

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - no new issues were detected. A full report can be found here. cc @maleadt

@vchuravy
Copy link
Member

Happy to take a look at enabling native half on ARM, but I'd prefer to get this merged first before the 1.6 branch.

Merging this first is okay by me. I am excited to see this land.

@maleadt
Copy link
Member Author

maleadt commented Oct 15, 2020

CI and PkgEval is clean, so let's merge this! Happy to do some follow-up development if anybody would still review post-merge, but I'd like to get this in before the 1.6 branch.

@vchuravy
Copy link
Member

GCC 12 now also supports this with https://godbolt.org/z/jq65f1Mo3 _Float16 and -fexcess-precision=16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code gpu Affects running Julia on a GPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants