Switch Float16 to LLVM's half #37510

maleadt · 2020-09-10T11:55:14Z

This PR builds on @vchuravy's many experiments switching to LLVM's half type for Float16 instead of our current i16 software implementation, and the accompanying integration with (a Julia version of) compiler-rt to implement the necessary intrinsics.
My motivation for this is to improve GPU codegen, where the i16 software implementation hurts us quite a bit.
Ref #26381 #18734 #18927

In short:

switch codegen of Float16 over to half from i16
move the software implementation to a Runtime module
use @ccallable to expose those functions under the expected intrinsic names

I went with a Julia runtime library, and not the Real Thing (tm), such that these changes are minimally invasive and just move the current software implementation around. We can always decide later to use LLVM's version. On OSX however, xcode already links LLVM's compiler-rt, so I disabled registration of our runtime functions there (alternatively, @vchuravy mentions we could customize the library names in the TLI so that we use the same software implementation across platforms).

TODO:

verify PPC support, we might have to apply a patch: https://bugs.llvm.org/show_bug.cgi?id=39865

range tests fail locally due to this:

function muladd(x, y, z)
    x*y + z
end
muladd(Float16(0.5283), Float16(0.584), Float16(-6.294e-5))
# Float16(0.3086) on master
# Float16(0.3083) on this PR

0.3083 is more accurate (Float16(0.5283 * 0.584 + -6.294e-5) == Float16(0.3083)), but I'd think that shouldn't happen as the result differs from computing incrementally:

julia> Float16(0.5283) * Float16(0.584)
Float16(0.3086)

julia> ans + Float16(-6.294e-5)
Float16(0.3086)

I don't see any muladd combination happening at the LLVM level, but the generated assembly performs the arithmetic in single instead of half precision (vmulss + vaddss).

Thoughts? Is this legal, and should the tests be fixed? @vchuravy mentions that the same happens with FP32 operations benefiting from x87's extended 80-bit precision, and the same reasoning could be applied here.

maleadt · 2020-09-10T12:48:16Z

Ah this fails bootstrap at float.jl already with JULIA_CPU_TARGET="generic"; I'll have to shuffle some code around.

maleadt · 2020-09-13T10:16:05Z

Looking into the range test failure, it seems like parts of TwicePrecision are broken with this PR:

julia> x,y = Float16(0.5635), Float16(0.6133)
(Float16(0.5635), Float16(0.6133))

julia> Base.mul12(x,y)
(Float16(0.3457), 0.0)

Whereas before:

julia> x,y = Float16(0.5635), Float16(0.6133)
(Float16(0.5635), Float16(0.6133))

julia> Base.mul12(x,y)
(Float16(0.3455), Float16(0.0001106))

... which is more correct (BigFloat(x) * BigFloat(y) = 0.345569610595703125). The error seems to originate from canonicalize2:

julia> julia> fma(x,y,-(x*y))
Float16(0.0001106)

# before
julia> Base.canonicalize2(x*y, Float16(0.0001106))
(Float16(0.3455), Float16(0.0001106))

# after
julia> Base.canonicalize2(x*y, Float16(0.0001106))
(Float16(0.3455), 0.0)

# after, but in a function (changing intermediate precision)
julia> f(x,y) = Base.canonicalize2(x*y, Float16(0.0001106))
f (generic function with 2 methods)

julia> f(x,y)
(Float16(0.3457), 0.0)

I need to look at this some more, but @timholy maybe you have some thoughts already, as you wrote those utilities & range tests.

yuyichao · 2020-09-16T19:25:07Z

Note that other than multiversioning issue, this is really going to throw off the GC root analysis. We assume normal llvm intrinsics do not call managed code and this can easily break that (though it'll certainly be fine most of the time.). It doesn't seem that any of these are too complex for C implementation (and I assume they usually are implemented in C anyway).

src/intrinsics.cpp

maleadt · 2020-10-06T13:54:07Z

Ported the Float16 software intrinsics to C per @yuyichao's request.

The remaining failures are in the ranges test suite, caused by the twice-precision arithmetic breaking down as Float16 calculations are actually performed in Float32 (so canonicalize2 and friends don't really work anymore). I couldn't find a way to convince LLVM to preserve 16-bit accuracy, or a way to get the current implementation working (except for forcing everything @noinline as values are returned in 16-bit registers, but that isn't an option of course). Maybe @timholy or @simonbyrne have some thoughts? FWIW, the range mentioned at the top of twiceprecision.jl works, (Float16(0.1):Float16(0.1):Float16(0.3))[3] == Float16(0.3).

maleadt · 2020-10-07T06:27:21Z

@nanosoldier runtests(ALL, vs = ":master")

maleadt · 2020-10-07T12:40:48Z

Looking a bit closer at the twice precision code used by ranges:

julia/base/twiceprecision.jl

Line 317 in 8987f7a

    
           # Use TwicePrecision only for Float64; use Float64 for T<:Union{Float16,Float32}

So IIUC Float16 ranged don't even use TwicePrecision (all of the constructors I tested indeed used Float64 to increase precision), so I've disabled the (now) broken tests of those utilities on Float16 values.

nanosoldier · 2020-10-07T13:25:49Z

Your package evaluation job has completed - possible new issues were detected. A full report can be found here. cc @maleadt

maleadt · 2020-10-07T13:30:25Z

@nanosoldier runtests(["BitIntegers", "BoltzmannMachines", "DIVAnd", "FastTransforms", "Graph500", "ImageFeatures", "IntervalTrees", "MIToS", "QuadGK", "RandomizedPropertyTest", "Revise", "Sherlogs", "ThreadPools", "YAActL"], vs = ":master")

nanosoldier · 2020-10-07T13:54:22Z

Your package evaluation job has completed - possible new issues were detected. A full report can be found here. cc @maleadt

maleadt · 2020-10-07T15:12:28Z

Looking into test failures:

BitIntegers: has special casing for Float16 that can just be removed. Relied on the Float16(x::Integer) we used to have, where for other floating-point types we only have conversions to Base's integer types. Will fix.
FastTransforms: legit issue, reduced to sinpi(Float16(-0.5)) returning different results. Will fix.
ImageFeatures: FREAK algorithm issues because results are now more accurate: Float16(743.5) - Float16(0.9165) = Float16(742.5), which used to be rounded to 742, but now the intermediate computation happens in Float32 which yields 742.5835 and rounds to 743. That seems to cause significant differences in final results, though (cc @timholy).
MIToS: small accuracy difference in final result of a test, 15.2 vs 15.21 (cc @diegozea)
QuadGK: small accuracy differences in final result of a test, 0.1047 vs 0.105, 0.4504 vs 0.451.

Debugging that sinpi difference, turns out we have other TwicePrecision-like operations:

julia/base/special/trig.jl

Line 764 in 8987f7a

rx = x-((x+t)-t) # zeros may be incorrectly signed

# expected result:
x = Float16(0.5)
s = maxintfloat(T) / 2 = Float16(1.024e3)
t = 3s = Float16(3.072e3)
rx = x-((x+t)-t) = Float16(0.5)

# with Float16 as half (which performs operations as Float32 on X86):
x = Float16(0.5)
s = maxintfloat(T) / 2 = Float16(1.024e3)
t = 3s = Float16(3.072e3)
rx = x-((x+t)-t) = 0.0

simonbyrne · 2020-10-07T23:23:26Z

So LLVM uses Float32 intermediate precision? I wonder what Swift does here, since they now offer Float16 support as well.

If that is the case, I think the easiest solution would be to promote to Float32 wherever we use these extended precision tricks.

simonbyrne · 2020-10-07T23:42:21Z

Alternatively we could convert to Float32 and back again for all functions (which will give correctly rounded results for all basic arithmetic ops except fma). I'll try to explore this a bit tonight.

vchuravy · 2020-10-08T00:06:48Z

Looking at the PRs that added Float16 they generally seem to promote to Float32 for Float16. (apple/swift-numerics#136). CUDA has a similar strategy in that they use faster less precise Float32 routines and then convert to Float16 + fix some of the bit patterns. I haven't seen anyone really attempting to define Float16 math generically.

But yes LLVM performs Float16 operations in extended precision (since there are no Float16 registers), and if my understanding is correct that is indeed legal...

simonbyrne · 2020-10-08T04:36:11Z

But yes LLVM performs Float16 operations in extended precision (since there are no Float16 registers), and if my understanding is correct that is indeed legal...

I'm not sure what you mean by legal here: we currently define all operations to return results of the same type as their inputs. Switching to allowing higher intermediate precision would be a significant breaking change as it fundamentally breaks the semantics of the language (as this PR shows).

Can you provide any links on how CUDA handles this? If there is no way to obtain performant code on GPUs without weakening these guarantees then its worth considering, but it is worth noting that Swift explicitly also provides the same strict behaviour.

The Julia runtime wasn't safe wrt. GC operations. I left the commit in for archival purposes, in case we want to revisit this again later.

maleadt · 2020-10-14T12:00:02Z

This should be complete now, and passes all tests locally. Let's do another PkgEval run:

@nanosoldier runtests(ALL, vs = ":master")

maleadt · 2020-10-14T12:03:40Z

src/runtime_intrinsics.c

 #define fpext(pr, a) \
-        if (!(osize > 8 * sizeof(a))) \
-            jl_error("fpext: output bitsize must be > input bitsize"); \
+        if (!(osize >= 8 * sizeof(a))) \


This is the only behavioral change, now allowing no-op fpext to simplify the implementation of Float16 intrinsics. Shouldn't matter in practice.

It’s invalid for llvm IR, so I’d just given the same restriction here

maleadt · 2020-10-14T12:34:20Z

And FWIW, moving the pass that truncates/extends Float16 operations earlier in the pipeline (to let LLVM optimize) gives the following for mul12:

@@ -1,15 +1,9 @@
 julia> @code_llvm Base.mul12(x,y)
 define [2 x half] @julia_mul12(half %0, half %1) #0 {
 top:
-    %0 = fpext half %0 to float
-    %1 = fpext half %1 to float
-    %2 = fmul float %0, %1
-    %2 = fptrunc float %2 to half
+    %2 = fmul half %0, %1
     %3 = fcmp une half %2, 0xH0000
-    %2 = fpext half %2 to float
-    %2 = fpext half %2 to float
-    %4 = fsub float %2, %2
-    %4 = fptrunc float %4 to half
+    %4 = fsub half %2, %2
     %5 = fcmp oeq half %4, 0xH0000
     %6 = fneg half %2
     %7 = fpext half %0 to float
@@ -17,18 +11,9 @@
     %9 = fpext half %6 to float
     %10 = call float @llvm.fma.f32(float %7, float %8, float %9)
     %11 = fptrunc float %10 to half
-    %2 = fpext half %2 to float
-    %11 = fpext half %11 to float
-    %12 = fadd float %2, %11
-    %12 = fptrunc float %12 to half
-    %2 = fpext half %2 to float
-    %12 = fpext half %12 to float
-    %13 = fsub float %2, %12
-    %13 = fptrunc float %13 to half
-    %13 = fpext half %13 to float
-    %11 = fpext half %11 to float
-    %14 = fadd float %13, %11
-    %14 = fptrunc float %14 to half
+    %12 = fadd half %2, %11
+    %13 = fsub half %2, %12
+    %14 = fadd half %13, %11
     %15 = and i1 %3, %5
     %.fca.0.insert7 = insertvalue [2 x half] undef, half %2, 0

Cleans up a lot, and keeps the ranges TwicePrecision tests working, but I don't think this is legal wrt. our Float16 semantics. For example, the initial mul + cmp might behave differently under Float32 extended precision. I think I'll just add a GVN pass to clean up identical conversions (instcombine is too aggressive, and is responsible for most of the changes seen above).

vchuravy

This is looking great! Fantastic work Tim.

Should we try turning on native Float16 support on AArch64/ARM?

In https://reviews.llvm.org/D57188 they are turned on for all AARCH64 targets.

// All AArch64 implementations support ARMv8 FP, which makes half a legal type.
HasLegalHalfType = true;
HasFloat16 = true;

CC: @yuyichao

yuyichao · 2020-10-14T15:47:14Z

AFAICT half is a native type on aarch64 though arithmetics on them is an optional feature. This is why the ccall ABI for aarch64 already uses the native half type IIRC. I don't know what's the desired property here.

maleadt · 2020-10-14T16:01:10Z

Happy to take a look at enabling native half on ARM, but I'd prefer to get this merged first before the 1.6 branch.

Also, this greatly improves performance already on CPUs that don't need to call to the runtime, so I guess we can close #29889:

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(1)
  median time:      97.775 ms (0.00% GC)

julia> @benchmark $(zeros(Float32 , 10^8)) .*= $(1)
  median time:      30.530 ms (0.00% GC)

julia> @benchmark $(zeros(Float64 , 10^8)) .*= $(1)
  median time:      62.900 ms (0.00% GC)

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(Float16(1.0))
  median time:      97.970 ms (0.00% GC)

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(1.0)
  median time:      240.721 ms (0.00% GC)

vs. before

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(1)
BenchmarkTools.Trial: 
  median time:      947.835 ms (0.00% GC)

julia> @benchmark $(zeros(Float32 , 10^8)) .*= $(1)
BenchmarkTools.Trial: 
  median time:      30.465 ms (0.00% GC)

julia> @benchmark $(zeros(Float64 , 10^8)) .*= $(1)
BenchmarkTools.Trial: 
  median time:      61.471 ms (0.00% GC)

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(Float16(1.0))
BenchmarkTools.Trial: 
  median time:      653.172 ms (0.00% GC)

julia> @benchmark $(zeros(Float16 , 10^8)) .*= $(1.0)
BenchmarkTools.Trial: 
  median time:      428.816 ms (0.00% GC)

vchuravy · 2020-10-14T16:06:53Z

https://www.keil.com/support/man/docs/armclang_ref/armclang_ref_sex1519040854421.htm

Also, _Float16 arithmetic operations directly map to Armv8.2-A half-precision floating-point instructions when they are enabled on Armv8.2-A and later architectures. This avoids the need for conversions to and from single-precision floating-point, and therefore results in more performant code. If the Armv8.2-A half-precision floating-point instructions are not available, _Float16 values are automatically promoted to single-precision, similar to the semantics of __fp16 except that the results continue to be stored in single-precision floating-point format instead of being converted back to half-precision floating-point format.

So we would need to test for Armv8.2-A and +fp16? So that would be HWCAP_FPHP? Would it be sufficient to check the function attribute target-features? Since the demoteFloat16 pass is running after multi-versioning? Or does that only include the explicitly requested features?

nanosoldier · 2020-10-14T16:41:25Z

Your package evaluation job has completed - possible new issues were detected. A full report can be found here. cc @maleadt

maleadt · 2020-10-14T16:42:40Z

@nanosoldier runtests(["DynamicalBilliards", "ExcelReaders", "FameSVD", "FlashWeave", "Genie", "ITensors", "IntervalTrees", "LoopThrottle", "Manifolds", "MemPool", "OptimKit", "Pidfile"], vs = ":master")

nanosoldier · 2020-10-14T17:36:07Z

Your package evaluation job has completed - no new issues were detected. A full report can be found here. cc @maleadt

vchuravy · 2020-10-14T17:56:52Z

Happy to take a look at enabling native half on ARM, but I'd prefer to get this merged first before the 1.6 branch.

Merging this first is okay by me. I am excited to see this land.

maleadt · 2020-10-15T06:02:42Z

CI and PkgEval is clean, so let's merge this! Happy to do some follow-up development if anybody would still review post-merge, but I'd like to get this in before the 1.6 branch.

vchuravy · 2022-06-21T22:45:39Z

GCC 12 now also supports this with https://godbolt.org/z/jq65f1Mo3 _Float16 and -fexcess-precision=16

maleadt added compiler:codegen Generation of LLVM IR and native code gpu Affects running Julia on a GPU labels Sep 10, 2020

vchuravy mentioned this pull request Sep 10, 2020

Switch Float16 LLVM representation from i16 to half #26381

Closed

maleadt marked this pull request as draft September 11, 2020 11:58

maleadt force-pushed the tb/half branch from 2a4186a to 2984b75 Compare September 11, 2020 12:01

vchuravy mentioned this pull request Sep 11, 2020

"LLVM ERROR: Broken module found, compilation aborted!" when compiling a sysimage with a ccallable function and a certain cpu target #34064

Closed

maleadt force-pushed the tb/half branch 2 times, most recently from f001564 to c2db424 Compare September 11, 2020 15:09

maleadt mentioned this pull request Sep 11, 2020

Multiversioning: support for aliases (from @ccallable) #37530

Merged

maleadt force-pushed the tb/half branch from c2db424 to 3cf2f31 Compare September 11, 2020 15:41

maleadt marked this pull request as ready for review September 11, 2020 15:41

maleadt force-pushed the tb/half branch 3 times, most recently from bf1c999 to f9c0c1e Compare October 6, 2020 12:42

yuyichao reviewed Oct 6, 2020

View reviewed changes

src/intrinsics.cpp Outdated Show resolved Hide resolved

src/intrinsics.cpp Outdated Show resolved Hide resolved

maleadt force-pushed the tb/half branch from f9c0c1e to 1c84fa8 Compare October 6, 2020 14:01

vchuravy and others added 3 commits October 14, 2020 11:59

Emit Float16 as half instead of i16.

6615292

Move Float16 software implementation to a Julia runtime library.

3851ac5

Port the Float16 runtime functions to C.

1e69b0a

The Julia runtime wasn't safe wrt. GC operations. I left the commit in for archival purposes, in case we want to revisit this again later.

maleadt force-pushed the tb/half branch 3 times, most recently from 34daf18 to 2c77780 Compare October 14, 2020 11:56

maleadt commented Oct 14, 2020

View reviewed changes

maleadt added 3 commits October 14, 2020 16:34

Explicitly demote Float16 operations to Float32.

2bc5b32

Use LLVM's FNeg to get a compliant neg_float for all Float types.

8f285bd

Implement Float16 runtime intrinsics.

b1b4def

maleadt force-pushed the tb/half branch from 2c77780 to b1b4def Compare October 14, 2020 15:09

vchuravy approved these changes Oct 14, 2020

View reviewed changes

maleadt merged commit 2eb5da0 into master Oct 15, 2020

maleadt deleted the tb/half branch October 15, 2020 06:03

vchuravy mentioned this pull request Nov 5, 2020

hypot for Float16 regressed #38311

Closed

This was referenced Apr 3, 2021

Make Float16(::BigFloat) go through Float64 #40245

Merged

Conversion from Float64 to Float16 not always rounded to nearest #40315

Open

ararslan mentioned this pull request Nov 20, 2021

Avoid -Wunused-function on macOS for some Float16 intrinsics #43174

Merged

maleadt mentioned this pull request Sep 26, 2023

Add native support for BFloat16. #51470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch Float16 to LLVM's half #37510

Switch Float16 to LLVM's half #37510

maleadt commented Sep 10, 2020

maleadt commented Sep 10, 2020 •

edited

Loading

maleadt commented Sep 13, 2020

yuyichao commented Sep 16, 2020

maleadt commented Oct 6, 2020

maleadt commented Oct 7, 2020

maleadt commented Oct 7, 2020 •

edited

Loading

nanosoldier commented Oct 7, 2020

maleadt commented Oct 7, 2020

nanosoldier commented Oct 7, 2020

maleadt commented Oct 7, 2020 •

edited

Loading

simonbyrne commented Oct 7, 2020

simonbyrne commented Oct 7, 2020

vchuravy commented Oct 8, 2020

simonbyrne commented Oct 8, 2020

maleadt commented Oct 14, 2020

maleadt Oct 14, 2020

vtjnash Oct 15, 2020

maleadt commented Oct 14, 2020

vchuravy left a comment

yuyichao commented Oct 14, 2020

maleadt commented Oct 14, 2020 •

edited

Loading

vchuravy commented Oct 14, 2020

nanosoldier commented Oct 14, 2020

maleadt commented Oct 14, 2020

nanosoldier commented Oct 14, 2020

vchuravy commented Oct 14, 2020

maleadt commented Oct 15, 2020

vchuravy commented Jun 21, 2022

Switch Float16 to LLVM's half #37510

Switch Float16 to LLVM's half #37510

Conversation

maleadt commented Sep 10, 2020

maleadt commented Sep 10, 2020 • edited Loading

maleadt commented Sep 13, 2020

yuyichao commented Sep 16, 2020

maleadt commented Oct 6, 2020

maleadt commented Oct 7, 2020

maleadt commented Oct 7, 2020 • edited Loading

nanosoldier commented Oct 7, 2020

maleadt commented Oct 7, 2020

nanosoldier commented Oct 7, 2020

maleadt commented Oct 7, 2020 • edited Loading

simonbyrne commented Oct 7, 2020

simonbyrne commented Oct 7, 2020

vchuravy commented Oct 8, 2020

simonbyrne commented Oct 8, 2020

maleadt commented Oct 14, 2020

maleadt Oct 14, 2020

Choose a reason for hiding this comment

vtjnash Oct 15, 2020

Choose a reason for hiding this comment

maleadt commented Oct 14, 2020

vchuravy left a comment

Choose a reason for hiding this comment

yuyichao commented Oct 14, 2020

maleadt commented Oct 14, 2020 • edited Loading

vchuravy commented Oct 14, 2020

nanosoldier commented Oct 14, 2020

maleadt commented Oct 14, 2020

nanosoldier commented Oct 14, 2020

vchuravy commented Oct 14, 2020

maleadt commented Oct 15, 2020

vchuravy commented Jun 21, 2022

maleadt commented Sep 10, 2020 •

edited

Loading

maleadt commented Oct 7, 2020 •

edited

Loading

maleadt commented Oct 7, 2020 •

edited

Loading

maleadt commented Oct 14, 2020 •

edited

Loading