Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion from Float64 to Float16 not always rounded to nearest #40315

Open
goualard-f opened this issue Apr 2, 2021 · 12 comments
Open

Conversion from Float64 to Float16 not always rounded to nearest #40315

goualard-f opened this issue Apr 2, 2021 · 12 comments
Labels

Comments

@goualard-f
Copy link

Numbers 63328.0 and 63360.0 are perfectly representable in the Float16 format (with nextfloat(Float16(63328.0)) == Float16(63360.0)). The middle of the interval [63328.0,63360.0] is 63344.0. All Float64 in [63344.0, 63360.0] should round to Float16(63360.0); all Float64 in [63328.0,63344.0) should round to Float16(63328.0).

Yet with Julia 1.6.0, we have:

julia> convert(Float16,63343.99805) == 63328.0
false
julia> convert(Float16,63343.99805) == 63360.0
true

@oscardssmith
Copy link
Member

Can you test the following to see if you can find any issues?

function gentables()
	BASE  = zeros(UInt16, 512)
	SHIFT = zeros(UInt8,  512)
	for i in 1:256
	    e=i-128
	    if e<-24
			BASE[i|0x000]=0x0000
			BASE[i|0x100]=0x8000
			SHIFT[i|0x000]=24
			SHIFT[i|0x100]=24;
		elseif e<-14
			BASE[i|0x000]=(0x0400>>(-e-14))
			BASE[i|0x100]=(0x0400>>(-e-14)) | 0x8000
			SHIFT[i|0x000]=-e-1
			SHIFT[i|0x100]=-e-1
		elseif e<=15
			BASE[i|0x000]=((e+15)<<10)
			BASE[i|0x100]=((e+15)<<10) | 0x8000
			SHIFT[i|0x000]=13
			SHIFT[i|0x100]=13
		elseif(e<128)
			BASE[i|0x000]=0x7C00
			BASE[i|0x100]=0xFC00
			SHIFT[i|0x000]=24
			SHIFT[i|0x100]=24
		else
			BASE[i|0x000]=0x7C00
			BASE[i|0x100]=0xFC00
			SHIFT[i|0x000]=13
			SHIFT[i|0x100]=13
		end
	end
	return Tuple(BASE), Tuple(SHIFT)
end
const BASE, const SHIFT = gentables()

function Float16(x::Float32)
	xu = reinterpret(UInt32, x)
	e = (xu>>23)&0x1ff+1
	ans  = @inbounds BASE[e]
	ans += unsafe_trunc(Int16, ((xu&0x7fffff)>>SHIFT[e]))
	return reinterpret(Float16, ans)
end

function Float16(x::Float64)
	xu = reinterpret(UInt64, x)
	e = xu>>52
	e = ifelse(e<2048, clamp(e-895,  1,   256), 
					   clamp(e-2687, 257, 512))
	ans  = @inbounds BASE[e]
	ans += @inbounds unsafe_trunc(Int16, ((xu&0xfffffffffffff)>>(SHIFT[e]+29)))
	return reinterpret(Float16, ans)
end

@kimikage
Copy link
Contributor

kimikage commented Apr 3, 2021

xref: #40245, #37510

@goualard-f
Copy link
Author

goualard-f commented Apr 3, 2021

@oscardssmith I have not reviewed your code yet but I have tested it. It cures the problem shown in the example above. However, it fails more often than the current code in Julia 1.6.0 to round correctly from Float64 to Float16.

Example: the conversion of Float64(51227.0) to a Float16 gives Float16(51232.0) with Julia 1.6.0 and Float16(51200.0) with the code above, which is patently wrong.

@milankl
Copy link

milankl commented Jun 21, 2021

Yet with Julia 1.6.0, we have:

julia> convert(Float16,63343.99805) == 63328.0
false
julia> convert(Float16,63343.99805) == 63360.0
true

I get

julia> convert(Float16,63343.99805) == 63328.0
true
julia> convert(Float16,63343.99805) == 63360.0
false
julia> convert(Float16,63344.0) == 63360.0
true
julia> convert(Float16,prevfloat(63344.0)) == 63360.0
false

on 1.6.0, which seems to be correct. Which also follows the tie-to-even rule as 63344 as the tie is round up, which is the even float

julia> bitstring(Float16(63328))
"0111101110111011"    # odd

julia> bitstring(Float16(63360))
"0111101110111100"    # even

@milankl
Copy link

milankl commented Jun 21, 2021

@oscardssmith I believe gentables() is not used anymore in 1.6?

julia>  @code_llvm Float16(63344.0)
;  @ float.jl:180 within `Float16'
define half @julia_Float16_185(double %0) {
top:
  %1 = fptrunc double %0 to half
  ret half %1
}

However, in Julia 1.5.2

julia> @code_llvm Float16(63344.0)

;  @ float.jl:252 within `Float16'
define i16 @julia_Float16_77(double) {
top:
; ┌ @ float.jl:251 within `Float32'
   %1 = fptrunc double %0 to float
; └
;  @ float.jl:252 within `Float16' @ float.jl:182
; ┌ @ essentials.jl:414 within `reinterpret'
   %2 = bitcast float %1 to i32
; └
;  @ float.jl:252 within `Float16' @ float.jl:183
; ┌ @ float.jl:536 within `isnan'
; │┌ @ float.jl:455 within `!='
    %3 = fcmp ord float %1, 0.000000e+00
; └└
  br i1 %3, label %L14, label %L5

L5:                                               ; preds = %top
;  @ float.jl:252 within `Float16' @ float.jl:184
; ┌ @ int.jl:455 within `>>'
   %4 = lshr i32 %2, 16
...

@goualard-f
Copy link
Author

@milankl I stand by what I said previously. On Julia 1.6.1, I get:

julia> convert(Float16,63343.99805) == 63328.0
false

julia> convert(Float16,63343.99805) == 63360.0
true

julia> convert(Float16,63344.0) == 63360.0
true

julia> convert(Float16,prevfloat(63344.0)) == 63360.0
true

I cannot explain the difference with what you obtain, though.

@milankl
Copy link

milankl commented Jun 21, 2021

Just checked on 1.6.1 too, where I reproduce my results from 1.6.0 - as it should be. However, I see that with <1.6 the look-up table from gentables() is used, which, I can confirm produces the wrong results as you are posting them. What's your @code_llvm output (see above)? I guess for some reason your Julia version doesn't use LLVM's half type, i.e. the fptrunc function.

@goualard-f
Copy link
Author

@milankl Here you go:

julia> @code_llvm Float16(63344.0)
;  @ float.jl:180 within `Float16'
define half @julia_Float16_1735(double %0) {
top:
  %1 = fptrunc double %0 to half
  ret half %1
}

Seems identical to your output. FWIW, I am running the 64 bits "Generic Linux on x86" version.

@milankl
Copy link

milankl commented Jun 21, 2021

Okay, I get the correct results on macOS (x86), linux (arm, A64FX), but can confirm that there's a bug on linux x86.

@maleadt
Copy link
Member

maleadt commented Jun 22, 2021

You should look at code_native; half is used at the LLVM level, but not supported by most back-ends, so we demote Float16 operations to Float32 arithmetic using an LLVM pass. This does rely on table-based conversions, but implemented in C here:

julia/src/intrinsics.cpp

Lines 1301 to 1493 in 9f32653

// float16 intrinsics
// TODO: use LLVM's compiler-rt
static inline float half_to_float(uint16_t ival)
{
uint32_t sign = (ival & 0x8000) >> 15;
uint32_t exp = (ival & 0x7c00) >> 10;
uint32_t sig = (ival & 0x3ff) >> 0;
uint32_t ret;
if (exp == 0) {
if (sig == 0) {
sign = sign << 31;
ret = sign | exp | sig;
}
else {
int n_bit = 1;
uint16_t bit = 0x0200;
while ((bit & sig) == 0) {
n_bit = n_bit + 1;
bit = bit >> 1;
}
sign = sign << 31;
exp = ((-14 - n_bit + 127) << 23);
sig = ((sig & (~bit)) << n_bit) << (23 - 10);
ret = sign | exp | sig;
}
}
else if (exp == 0x1f) {
if (sig == 0) { // Inf
if (sign == 0)
ret = 0x7f800000;
else
ret = 0xff800000;
}
else // NaN
ret = 0x7fc00000 | (sign << 31) | (sig << (23 - 10));
}
else {
sign = sign << 31;
exp = ((exp - 15 + 127) << 23);
sig = sig << (23 - 10);
ret = sign | exp | sig;
}
float fret;
memcpy(&fret, &ret, sizeof(float));
return fret;
}
// float to half algorithm from:
// "Fast Half Float Conversion" by Jeroen van der Zijp
// ftp://ftp.fox-toolkit.org/pub/fasthalffloatconversion.pdf
//
// With adjustments for round-to-nearest, ties to even.
static uint16_t basetable[512] = {
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0400, 0x0800, 0x0c00, 0x1000, 0x1400, 0x1800, 0x1c00, 0x2000,
0x2400, 0x2800, 0x2c00, 0x3000, 0x3400, 0x3800, 0x3c00, 0x4000, 0x4400, 0x4800, 0x4c00,
0x5000, 0x5400, 0x5800, 0x5c00, 0x6000, 0x6400, 0x6800, 0x6c00, 0x7000, 0x7400, 0x7800,
0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00,
0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00,
0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00,
0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00,
0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00,
0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00,
0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00,
0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00,
0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00,
0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00,
0x7c00, 0x7c00, 0x7c00, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8400, 0x8800, 0x8c00, 0x9000, 0x9400,
0x9800, 0x9c00, 0xa000, 0xa400, 0xa800, 0xac00, 0xb000, 0xb400, 0xb800, 0xbc00, 0xc000,
0xc400, 0xc800, 0xcc00, 0xd000, 0xd400, 0xd800, 0xdc00, 0xe000, 0xe400, 0xe800, 0xec00,
0xf000, 0xf400, 0xf800, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00,
0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00,
0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00,
0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00,
0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00,
0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00,
0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00,
0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00,
0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00,
0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00,
0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00};
static uint8_t shifttable[512] = {
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x18, 0x17, 0x16, 0x15, 0x14, 0x13, 0x12, 0x11, 0x10, 0x0f,
0x0e, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d,
0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d,
0x0d, 0x0d, 0x0d, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x0d, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19,
0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x18, 0x17, 0x16, 0x15, 0x14, 0x13,
0x12, 0x11, 0x10, 0x0f, 0x0e, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d,
0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d,
0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18,
0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x0d};
static inline uint16_t float_to_half(float param)
{
uint32_t f;
memcpy(&f, &param, sizeof(float));
if (isnan(param)) {
uint32_t t = 0x8000 ^ (0x8000 & ((uint16_t)(f >> 0x10)));
return t ^ ((uint16_t)(f >> 0xd));
}
int i = ((f & ~0x007fffff) >> 23);
uint8_t sh = shifttable[i];
f &= 0x007fffff;
// If `val` is subnormal, the tables are set up to force the
// result to 0, so the significand has an implicit `1` in the
// cases we care about.
f |= 0x007fffff + 0x1;
uint16_t h = (uint16_t)(basetable[i] + ((f >> sh) & 0x03ff));
// round
// NOTE: we maybe should ignore NaNs here, but the payload is
// getting truncated anyway so "rounding" it might not matter
int nextbit = (f >> (sh - 1)) & 1;
if (nextbit != 0 && (h & 0x7C00) != 0x7C00) {
// Round halfway to even or check lower bits
if ((h & 1) == 1 || (f & ((1 << (sh - 1)) - 1)) != 0)
h += UINT16_C(1);
}
return h;
}
#if !defined(_OS_DARWIN_) // xcode already links compiler-rt
extern "C" JL_DLLEXPORT float __gnu_h2f_ieee(uint16_t param)
{
return half_to_float(param);
}
extern "C" JL_DLLEXPORT float __extendhfsf2(uint16_t param)
{
return half_to_float(param);
}
extern "C" JL_DLLEXPORT uint16_t __gnu_f2h_ieee(float param)
{
return float_to_half(param);
}
extern "C" JL_DLLEXPORT uint16_t __truncdfhf2(double param)
{
return float_to_half((float)param);
}

@milankl
Copy link

milankl commented Jun 22, 2021

On 1.5.2, macOS, x86 I get (result is incorrect)

julia> @code_native Float16(63343.99805)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ float.jl:252 within `Float16'
; │┌ @ float.jl:251 within `Float32'
	vcvtsd2ss	%xmm0, %xmm0, %xmm0
; │└
; │ @ float.jl:252 within `Float16' @ float.jl:182
; │┌ @ essentials.jl:414 within `reinterpret'
	vmovd	%xmm0, %ecx
; │└
; │ @ float.jl:252 within `Float16' @ float.jl:183
; │┌ @ float.jl:536 within `isnan'
; ││┌ @ float.jl:455 within `!='
	vucomiss	%xmm0, %xmm0
; │└└
...

On 1.6.1, macOS, x86 this is (result is correct)

julia> @code_native Float16(63343.99805)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ float.jl:180 within `Float16'
	pushq	%rax
	movabsq	$__truncdfhf2, %rax
	callq	*%rax
	popq	%rcx
	retq
	nop
; └

which looks similar to 1.6.0, linux, x86 (but result is incorrect)

julia> @code_native Float16(63343.99805)
	.text
; ┌ @ float.jl:180 within `Float16'
	pushq	%rax
	movabsq	$__truncdfhf2, %rax
	callq	*%rax
	popq	%rcx
	retq
	nop
; └

and this is 1.6.0, linux, arm, A64FX (i.e. with Float16 hardware, result is correct)

julia> @code_native Float16(63343.99805)
	.text
; ┌ @ float.jl:180 within `Float16'
	fcvt	h0, d0
	ret
	nop
	nop
	nop
	nop
	nop
	nop
; └

@maleadt Does that make sense to you? Why would 1.6 on macOS produce correct results, but 1.5 doesn't?

@maleadt
Copy link
Member

maleadt commented Jun 22, 2021

Why would 1.6 on macOS produce correct results, but 1.5 doesn't?

Probably some platform-dependent error in that C implementation of truncdfhf2. That implementation is different from what we used to have in 1.5:

julia/base/float.jl

Lines 140 to 248 in 69fcb57

# Float32 -> Float16 algorithm from:
# "Fast Half Float Conversion" by Jeroen van der Zijp
# ftp://ftp.fox-toolkit.org/pub/fasthalffloatconversion.pdf
#
# With adjustments for round-to-nearest, ties to even.
#
let _basetable = Vector{UInt16}(undef, 512),
_shifttable = Vector{UInt8}(undef, 512)
for i = 0:255
e = i - 127
if e < -25 # Very small numbers map to zero
_basetable[i|0x000+1] = 0x0000
_basetable[i|0x100+1] = 0x8000
_shifttable[i|0x000+1] = 25
_shifttable[i|0x100+1] = 25
elseif e < -14 # Small numbers map to denorms
_basetable[i|0x000+1] = 0x0000
_basetable[i|0x100+1] = 0x8000
_shifttable[i|0x000+1] = -e-1
_shifttable[i|0x100+1] = -e-1
elseif e <= 15 # Normal numbers just lose precision
_basetable[i|0x000+1] = ((e+15)<<10)
_basetable[i|0x100+1] = ((e+15)<<10) | 0x8000
_shifttable[i|0x000+1] = 13
_shifttable[i|0x100+1] = 13
elseif e < 128 # Large numbers map to Infinity
_basetable[i|0x000+1] = 0x7C00
_basetable[i|0x100+1] = 0xFC00
_shifttable[i|0x000+1] = 24
_shifttable[i|0x100+1] = 24
else # Infinity and NaN's stay Infinity and NaN's
_basetable[i|0x000+1] = 0x7C00
_basetable[i|0x100+1] = 0xFC00
_shifttable[i|0x000+1] = 13
_shifttable[i|0x100+1] = 13
end
end
global const shifttable = (_shifttable...,)
global const basetable = (_basetable...,)
end
function Float16(val::Float32)
f = reinterpret(UInt32, val)
if isnan(val)
t = 0x8000 (0x8000 & ((f >> 0x10) % UInt16))
return reinterpret(Float16, t ((f >> 0xd) % UInt16))
end
i = ((f & ~significand_mask(Float32)) >> significand_bits(Float32)) + 1
@inbounds sh = shifttable[i]
f &= significand_mask(Float32)
# If `val` is subnormal, the tables are set up to force the
# result to 0, so the significand has an implicit `1` in the
# cases we care about.
f |= significand_mask(Float32) + 0x1
@inbounds h = (basetable[i] + (f >> sh) & significand_mask(Float16)) % UInt16
# round
# NOTE: we maybe should ignore NaNs here, but the payload is
# getting truncated anyway so "rounding" it might not matter
nextbit = (f >> (sh-1)) & 1
if nextbit != 0 && (h & 0x7C00) != 0x7C00
# Round halfway to even or check lower bits
if h&1 == 1 || (f & ((1<<(sh-1))-1)) != 0
h += UInt16(1)
end
end
reinterpret(Float16, h)
end
function Float32(val::Float16)
local ival::UInt32 = reinterpret(UInt16, val)
local sign::UInt32 = (ival & 0x8000) >> 15
local exp::UInt32 = (ival & 0x7c00) >> 10
local sig::UInt32 = (ival & 0x3ff) >> 0
local ret::UInt32
if exp == 0
if sig == 0
sign = sign << 31
ret = sign | exp | sig
else
n_bit = 1
bit = 0x0200
while (bit & sig) == 0
n_bit = n_bit + 1
bit = bit >> 1
end
sign = sign << 31
exp = ((-14 - n_bit + 127) << 23) % UInt32
sig = ((sig & (~bit)) << n_bit) << (23 - 10)
ret = sign | exp | sig
end
elseif exp == 0x1f
if sig == 0 # Inf
if sign == 0
ret = 0x7f800000
else
ret = 0xff800000
end
else # NaN
ret = 0x7fc00000 | (sign<<31) | (sig<<(23-10))
end
else
sign = sign << 31
exp = ((exp - 15 + 127) << 23) % UInt32
sig = sig << (23 - 10)
ret = sign | exp | sig
end
return reinterpret(Float32, ret)
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants