-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch Float16 to LLVM's half #37510
Conversation
Ah this fails bootstrap at |
f001564
to
c2db424
Compare
Looking into the julia> x,y = Float16(0.5635), Float16(0.6133)
(Float16(0.5635), Float16(0.6133))
julia> Base.mul12(x,y)
(Float16(0.3457), 0.0) Whereas before: julia> x,y = Float16(0.5635), Float16(0.6133)
(Float16(0.5635), Float16(0.6133))
julia> Base.mul12(x,y)
(Float16(0.3455), Float16(0.0001106)) ... which is more correct ( julia> julia> fma(x,y,-(x*y))
Float16(0.0001106)
# before
julia> Base.canonicalize2(x*y, Float16(0.0001106))
(Float16(0.3455), Float16(0.0001106))
# after
julia> Base.canonicalize2(x*y, Float16(0.0001106))
(Float16(0.3455), 0.0)
# after, but in a function (changing intermediate precision)
julia> f(x,y) = Base.canonicalize2(x*y, Float16(0.0001106))
f (generic function with 2 methods)
julia> f(x,y)
(Float16(0.3457), 0.0) I need to look at this some more, but @timholy maybe you have some thoughts already, as you wrote those utilities & range tests. |
Note that other than multiversioning issue, this is really going to throw off the GC root analysis. We assume normal llvm intrinsics do not call managed code and this can easily break that (though it'll certainly be fine most of the time.). It doesn't seem that any of these are too complex for C implementation (and I assume they usually are implemented in C anyway). |
bf1c999
to
f9c0c1e
Compare
Ported the Float16 software intrinsics to C per @yuyichao's request. The remaining failures are in the |
@nanosoldier |
Looking a bit closer at the twice precision code used by ranges: Line 317 in 8987f7a
So IIUC Float16 ranged don't even use TwicePrecision (all of the constructors I tested indeed used Float64 to increase precision), so I've disabled the (now) broken tests of those utilities on Float16 values. |
Your package evaluation job has completed - possible new issues were detected. A full report can be found here. cc @maleadt |
@nanosoldier |
Your package evaluation job has completed - possible new issues were detected. A full report can be found here. cc @maleadt |
Looking into test failures:
Debugging that Line 764 in 8987f7a
# expected result:
x = Float16(0.5)
s = maxintfloat(T) / 2 = Float16(1.024e3)
t = 3s = Float16(3.072e3)
rx = x-((x+t)-t) = Float16(0.5)
# with Float16 as half (which performs operations as Float32 on X86):
x = Float16(0.5)
s = maxintfloat(T) / 2 = Float16(1.024e3)
t = 3s = Float16(3.072e3)
rx = x-((x+t)-t) = 0.0 |
So LLVM uses Float32 intermediate precision? I wonder what Swift does here, since they now offer Float16 support as well. If that is the case, I think the easiest solution would be to promote to Float32 wherever we use these extended precision tricks. |
Alternatively we could convert to Float32 and back again for all functions (which will give correctly rounded results for all basic arithmetic ops except fma). I'll try to explore this a bit tonight. |
Looking at the PRs that added But yes LLVM performs Float16 operations in extended precision (since there are no Float16 registers), and if my understanding is correct that is indeed legal... |
I'm not sure what you mean by legal here: we currently define all operations to return results of the same type as their inputs. Switching to allowing higher intermediate precision would be a significant breaking change as it fundamentally breaks the semantics of the language (as this PR shows). Can you provide any links on how CUDA handles this? If there is no way to obtain performant code on GPUs without weakening these guarantees then its worth considering, but it is worth noting that Swift explicitly also provides the same strict behaviour. |
The Julia runtime wasn't safe wrt. GC operations. I left the commit in for archival purposes, in case we want to revisit this again later.
34daf18
to
2c77780
Compare
This should be complete now, and passes all tests locally. Let's do another PkgEval run: @nanosoldier |
#define fpext(pr, a) \ | ||
if (!(osize > 8 * sizeof(a))) \ | ||
jl_error("fpext: output bitsize must be > input bitsize"); \ | ||
if (!(osize >= 8 * sizeof(a))) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only behavioral change, now allowing no-op fpext
to simplify the implementation of Float16 intrinsics. Shouldn't matter in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s invalid for llvm IR, so I’d just given the same restriction here
And FWIW, moving the pass that truncates/extends Float16 operations earlier in the pipeline (to let LLVM optimize) gives the following for @@ -1,15 +1,9 @@
julia> @code_llvm Base.mul12(x,y)
define [2 x half] @julia_mul12(half %0, half %1) #0 {
top:
- %0 = fpext half %0 to float
- %1 = fpext half %1 to float
- %2 = fmul float %0, %1
- %2 = fptrunc float %2 to half
+ %2 = fmul half %0, %1
%3 = fcmp une half %2, 0xH0000
- %2 = fpext half %2 to float
- %2 = fpext half %2 to float
- %4 = fsub float %2, %2
- %4 = fptrunc float %4 to half
+ %4 = fsub half %2, %2
%5 = fcmp oeq half %4, 0xH0000
%6 = fneg half %2
%7 = fpext half %0 to float
@@ -17,18 +11,9 @@
%9 = fpext half %6 to float
%10 = call float @llvm.fma.f32(float %7, float %8, float %9)
%11 = fptrunc float %10 to half
- %2 = fpext half %2 to float
- %11 = fpext half %11 to float
- %12 = fadd float %2, %11
- %12 = fptrunc float %12 to half
- %2 = fpext half %2 to float
- %12 = fpext half %12 to float
- %13 = fsub float %2, %12
- %13 = fptrunc float %13 to half
- %13 = fpext half %13 to float
- %11 = fpext half %11 to float
- %14 = fadd float %13, %11
- %14 = fptrunc float %14 to half
+ %12 = fadd half %2, %11
+ %13 = fsub half %2, %12
+ %14 = fadd half %13, %11
%15 = and i1 %3, %5
%.fca.0.insert7 = insertvalue [2 x half] undef, half %2, 0 Cleans up a lot, and keeps the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! Fantastic work Tim.
Should we try turning on native Float16 support on AArch64/ARM?
In https://reviews.llvm.org/D57188 they are turned on for all AARCH64 targets.
// All AArch64 implementations support ARMv8 FP, which makes half a legal type.
HasLegalHalfType = true;
HasFloat16 = true;
CC: @yuyichao
AFAICT |
Happy to take a look at enabling native Also, this greatly improves performance already on CPUs that don't need to call to the runtime, so I guess we can close #29889:
vs. before
|
https://www.keil.com/support/man/docs/armclang_ref/armclang_ref_sex1519040854421.htm
So we would need to test for Armv8.2-A and |
Your package evaluation job has completed - possible new issues were detected. A full report can be found here. cc @maleadt |
@nanosoldier |
Your package evaluation job has completed - no new issues were detected. A full report can be found here. cc @maleadt |
Merging this first is okay by me. I am excited to see this land. |
CI and PkgEval is clean, so let's merge this! Happy to do some follow-up development if anybody would still review post-merge, but I'd like to get this in before the 1.6 branch. |
GCC 12 now also supports this with https://godbolt.org/z/jq65f1Mo3 |
This PR builds on @vchuravy's many experiments switching to LLVM's
half
type for Float16 instead of our currenti16
software implementation, and the accompanying integration with (a Julia version of) compiler-rt to implement the necessary intrinsics.My motivation for this is to improve GPU codegen, where the
i16
software implementation hurts us quite a bit.Ref #26381 #18734 #18927
In short:
half
fromi16
@ccallable
to expose those functions under the expected intrinsic namesI went with a Julia runtime library, and not the Real Thing (tm), such that these changes are minimally invasive and just move the current software implementation around. We can always decide later to use LLVM's version. On OSX however, xcode already links LLVM's compiler-rt, so I disabled registration of our runtime functions there (alternatively, @vchuravy mentions we could customize the library names in the TLI so that we use the same software implementation across platforms).
TODO:
range
tests fail locally due to this:0.3083
is more accurate (Float16(0.5283 * 0.584 + -6.294e-5) == Float16(0.3083)
), but I'd think that shouldn't happen as the result differs from computing incrementally:I don't see any
muladd
combination happening at the LLVM level, but the generated assembly performs the arithmetic in single instead of half precision (vmulss
+vaddss
).Thoughts? Is this legal, and should the tests be fixed? @vchuravy mentions that the same happens with FP32 operations benefiting from x87's extended 80-bit precision, and the same reasoning could be applied here.