Conversion from Float64 to Float16 not always rounded to nearest #40315

goualard-f · 2021-04-02T14:47:42Z

Numbers 63328.0 and 63360.0 are perfectly representable in the Float16 format (with nextfloat(Float16(63328.0)) == Float16(63360.0)). The middle of the interval [63328.0,63360.0] is 63344.0. All Float64 in [63344.0, 63360.0] should round to Float16(63360.0); all Float64 in [63328.0,63344.0) should round to Float16(63328.0).

Yet with Julia 1.6.0, we have:

julia> convert(Float16,63343.99805) == 63328.0
false
julia> convert(Float16,63343.99805) == 63360.0
true

oscardssmith · 2021-04-03T04:17:30Z

Can you test the following to see if you can find any issues?

function gentables()
	BASE  = zeros(UInt16, 512)
	SHIFT = zeros(UInt8,  512)
	for i in 1:256
	    e=i-128
	    if e<-24
			BASE[i|0x000]=0x0000
			BASE[i|0x100]=0x8000
			SHIFT[i|0x000]=24
			SHIFT[i|0x100]=24;
		elseif e<-14
			BASE[i|0x000]=(0x0400>>(-e-14))
			BASE[i|0x100]=(0x0400>>(-e-14)) | 0x8000
			SHIFT[i|0x000]=-e-1
			SHIFT[i|0x100]=-e-1
		elseif e<=15
			BASE[i|0x000]=((e+15)<<10)
			BASE[i|0x100]=((e+15)<<10) | 0x8000
			SHIFT[i|0x000]=13
			SHIFT[i|0x100]=13
		elseif(e<128)
			BASE[i|0x000]=0x7C00
			BASE[i|0x100]=0xFC00
			SHIFT[i|0x000]=24
			SHIFT[i|0x100]=24
		else
			BASE[i|0x000]=0x7C00
			BASE[i|0x100]=0xFC00
			SHIFT[i|0x000]=13
			SHIFT[i|0x100]=13
		end
	end
	return Tuple(BASE), Tuple(SHIFT)
end
const BASE, const SHIFT = gentables()

function Float16(x::Float32)
	xu = reinterpret(UInt32, x)
	e = (xu>>23)&0x1ff+1
	ans  = @inbounds BASE[e]
	ans += unsafe_trunc(Int16, ((xu&0x7fffff)>>SHIFT[e]))
	return reinterpret(Float16, ans)
end

function Float16(x::Float64)
	xu = reinterpret(UInt64, x)
	e = xu>>52
	e = ifelse(e<2048, clamp(e-895,  1,   256), 
					   clamp(e-2687, 257, 512))
	ans  = @inbounds BASE[e]
	ans += @inbounds unsafe_trunc(Int16, ((xu&0xfffffffffffff)>>(SHIFT[e]+29)))
	return reinterpret(Float16, ans)
end

kimikage · 2021-04-03T14:39:01Z

xref: #40245, #37510

goualard-f · 2021-04-03T15:13:30Z

@oscardssmith I have not reviewed your code yet but I have tested it. It cures the problem shown in the example above. However, it fails more often than the current code in Julia 1.6.0 to round correctly from Float64 to Float16.

Example: the conversion of Float64(51227.0) to a Float16 gives Float16(51232.0) with Julia 1.6.0 and Float16(51200.0) with the code above, which is patently wrong.

milankl · 2021-06-21T15:55:13Z

Yet with Julia 1.6.0, we have:

julia> convert(Float16,63343.99805) == 63328.0
false
julia> convert(Float16,63343.99805) == 63360.0
true

I get

julia> convert(Float16,63343.99805) == 63328.0
true
julia> convert(Float16,63343.99805) == 63360.0
false
julia> convert(Float16,63344.0) == 63360.0
true
julia> convert(Float16,prevfloat(63344.0)) == 63360.0
false

on 1.6.0, which seems to be correct. Which also follows the tie-to-even rule as 63344 as the tie is round up, which is the even float

julia> bitstring(Float16(63328))
"0111101110111011"    # odd

julia> bitstring(Float16(63360))
"0111101110111100"    # even

milankl · 2021-06-21T16:04:19Z

@oscardssmith I believe gentables() is not used anymore in 1.6?

julia>  @code_llvm Float16(63344.0)
;  @ float.jl:180 within `Float16'
define half @julia_Float16_185(double %0) {
top:
  %1 = fptrunc double %0 to half
  ret half %1
}

However, in Julia 1.5.2

julia> @code_llvm Float16(63344.0)

;  @ float.jl:252 within `Float16'
define i16 @julia_Float16_77(double) {
top:
; ┌ @ float.jl:251 within `Float32'
   %1 = fptrunc double %0 to float
; └
;  @ float.jl:252 within `Float16' @ float.jl:182
; ┌ @ essentials.jl:414 within `reinterpret'
   %2 = bitcast float %1 to i32
; └
;  @ float.jl:252 within `Float16' @ float.jl:183
; ┌ @ float.jl:536 within `isnan'
; │┌ @ float.jl:455 within `!='
    %3 = fcmp ord float %1, 0.000000e+00
; └└
  br i1 %3, label %L14, label %L5

L5:                                               ; preds = %top
;  @ float.jl:252 within `Float16' @ float.jl:184
; ┌ @ int.jl:455 within `>>'
   %4 = lshr i32 %2, 16
...

goualard-f · 2021-06-21T16:10:20Z

@milankl I stand by what I said previously. On Julia 1.6.1, I get:

julia> convert(Float16,63343.99805) == 63328.0
false

julia> convert(Float16,63343.99805) == 63360.0
true

julia> convert(Float16,63344.0) == 63360.0
true

julia> convert(Float16,prevfloat(63344.0)) == 63360.0
true

I cannot explain the difference with what you obtain, though.

milankl · 2021-06-21T16:17:49Z

Just checked on 1.6.1 too, where I reproduce my results from 1.6.0 - as it should be. However, I see that with <1.6 the look-up table from gentables() is used, which, I can confirm produces the wrong results as you are posting them. What's your @code_llvm output (see above)? I guess for some reason your Julia version doesn't use LLVM's half type, i.e. the fptrunc function.

goualard-f · 2021-06-21T16:36:46Z

@milankl Here you go:

julia> @code_llvm Float16(63344.0)
;  @ float.jl:180 within `Float16'
define half @julia_Float16_1735(double %0) {
top:
  %1 = fptrunc double %0 to half
  ret half %1
}

Seems identical to your output. FWIW, I am running the 64 bits "Generic Linux on x86" version.

milankl · 2021-06-21T16:43:07Z

Okay, I get the correct results on macOS (x86), linux (arm, A64FX), but can confirm that there's a bug on linux x86.

maleadt · 2021-06-22T14:28:12Z

You should look at code_native; half is used at the LLVM level, but not supported by most back-ends, so we demote Float16 operations to Float32 arithmetic using an LLVM pass. This does rely on table-based conversions, but implemented in C here:

julia/src/intrinsics.cpp

Lines 1301 to 1493 in 9f32653

    
           // float16 intrinsics 
        
           // TODO: use LLVM's compiler-rt 
        
           static inline float half_to_float(uint16_t ival) 
        
           { 
        
               uint32_t sign = (ival & 0x8000) >> 15; 
        
               uint32_t exp = (ival & 0x7c00) >> 10; 
        
               uint32_t sig = (ival & 0x3ff) >> 0; 
        
               uint32_t ret; 
        
               if (exp == 0) { 
        
                   if (sig == 0) { 
        
                       sign = sign << 31; 
        
                       ret = sign | exp | sig; 
        
                   } 
        
                   else { 
        
                       int n_bit = 1; 
        
                       uint16_t bit = 0x0200; 
        
                       while ((bit & sig) == 0) { 
        
                           n_bit = n_bit + 1; 
        
                           bit = bit >> 1; 
        
                       } 
        
                       sign = sign << 31; 
        
                       exp = ((-14 - n_bit + 127) << 23); 
        
                       sig = ((sig & (~bit)) << n_bit) << (23 - 10); 
        
                       ret = sign | exp | sig; 
        
                   } 
        
               } 
        
               else if (exp == 0x1f) { 
        
                   if (sig == 0) { // Inf 
        
                       if (sign == 0) 
        
                           ret = 0x7f800000; 
        
                       else 
        
                           ret = 0xff800000; 
        
                   } 
        
                   else // NaN 
        
                       ret = 0x7fc00000 | (sign << 31) | (sig << (23 - 10)); 
        
               } 
        
               else { 
        
                   sign = sign << 31; 
        
                   exp = ((exp - 15 + 127) << 23); 
        
                   sig = sig << (23 - 10); 
        
                   ret = sign | exp | sig; 
        
               } 
        
               float fret; 
        
               memcpy(&fret, &ret, sizeof(float)); 
        
               return fret; 
        
           } 
        
           // float to half algorithm from: 
        
           //   "Fast Half Float Conversion" by Jeroen van der Zijp 
        
           //   ftp://ftp.fox-toolkit.org/pub/fasthalffloatconversion.pdf 
        
           // 
        
           // With adjustments for round-to-nearest, ties to even. 
        
           static uint16_t basetable[512] = { 
        
               0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 
        
               0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 
        
               0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 
        
               0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 
        
               0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 
        
               0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 
        
               0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 
        
               0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 
        
               0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 
        
               0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 
        
               0x0000, 0x0000, 0x0000, 0x0400, 0x0800, 0x0c00, 0x1000, 0x1400, 0x1800, 0x1c00, 0x2000, 
        
               0x2400, 0x2800, 0x2c00, 0x3000, 0x3400, 0x3800, 0x3c00, 0x4000, 0x4400, 0x4800, 0x4c00, 
        
               0x5000, 0x5400, 0x5800, 0x5c00, 0x6000, 0x6400, 0x6800, 0x6c00, 0x7000, 0x7400, 0x7800, 
        
               0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 
        
               0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 
        
               0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 
        
               0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 
        
               0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 
        
               0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 
        
               0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 
        
               0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 
        
               0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 
        
               0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 0x7c00, 
        
               0x7c00, 0x7c00, 0x7c00, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 
        
               0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 
        
               0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 
        
               0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 
        
               0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 
        
               0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 
        
               0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 
        
               0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 
        
               0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 
        
               0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 
        
               0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8400, 0x8800, 0x8c00, 0x9000, 0x9400, 
        
               0x9800, 0x9c00, 0xa000, 0xa400, 0xa800, 0xac00, 0xb000, 0xb400, 0xb800, 0xbc00, 0xc000, 
        
               0xc400, 0xc800, 0xcc00, 0xd000, 0xd400, 0xd800, 0xdc00, 0xe000, 0xe400, 0xe800, 0xec00, 
        
               0xf000, 0xf400, 0xf800, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 
        
               0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 
        
               0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 
        
               0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 
        
               0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 
        
               0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 
        
               0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 
        
               0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 
        
               0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 
        
               0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 
        
               0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00, 0xfc00}; 
        
           static uint8_t shifttable[512] = { 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x18, 0x17, 0x16, 0x15, 0x14, 0x13, 0x12, 0x11, 0x10, 0x0f, 
        
               0x0e, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 
        
               0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 
        
               0x0d, 0x0d, 0x0d, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x0d, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 
        
               0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x19, 0x18, 0x17, 0x16, 0x15, 0x14, 0x13, 
        
               0x12, 0x11, 0x10, 0x0f, 0x0e, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 
        
               0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 
        
               0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x0d, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 
        
               0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x0d}; 
        
           static inline uint16_t float_to_half(float param) 
        
           { 
        
               uint32_t f; 
        
               memcpy(&f, &param, sizeof(float)); 
        
               if (isnan(param)) { 
        
                   uint32_t t = 0x8000 ^ (0x8000 & ((uint16_t)(f >> 0x10))); 
        
                   return t ^ ((uint16_t)(f >> 0xd)); 
        
               } 
        
               int i = ((f & ~0x007fffff) >> 23); 
        
               uint8_t sh = shifttable[i]; 
        
               f &= 0x007fffff; 
        
               // If `val` is subnormal, the tables are set up to force the 
        
               // result to 0, so the significand has an implicit `1` in the 
        
               // cases we care about. 
        
               f |= 0x007fffff + 0x1; 
        
               uint16_t h = (uint16_t)(basetable[i] + ((f >> sh) & 0x03ff)); 
        
               // round 
        
               // NOTE: we maybe should ignore NaNs here, but the payload is 
        
               // getting truncated anyway so "rounding" it might not matter 
        
               int nextbit = (f >> (sh - 1)) & 1; 
        
               if (nextbit != 0 && (h & 0x7C00) != 0x7C00) { 
        
                   // Round halfway to even or check lower bits 
        
                   if ((h & 1) == 1 || (f & ((1 << (sh - 1)) - 1)) != 0) 
        
                       h += UINT16_C(1); 
        
               } 
        
               return h; 
        
           } 
        
           #if !defined(_OS_DARWIN_)   // xcode already links compiler-rt 
        
           extern "C" JL_DLLEXPORT float __gnu_h2f_ieee(uint16_t param) 
        
           { 
        
               return half_to_float(param); 
        
           } 
        
           extern "C" JL_DLLEXPORT float __extendhfsf2(uint16_t param) 
        
           { 
        
               return half_to_float(param); 
        
           } 
        
           extern "C" JL_DLLEXPORT uint16_t __gnu_f2h_ieee(float param) 
        
           { 
        
               return float_to_half(param); 
        
           } 
        
           extern "C" JL_DLLEXPORT uint16_t __truncdfhf2(double param) 
        
           { 
        
               return float_to_half((float)param); 
        
           }

milankl · 2021-06-22T16:07:33Z

On 1.5.2, macOS, x86 I get (result is incorrect)

julia> @code_native Float16(63343.99805)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ float.jl:252 within `Float16'
; │┌ @ float.jl:251 within `Float32'
	vcvtsd2ss	%xmm0, %xmm0, %xmm0
; │└
; │ @ float.jl:252 within `Float16' @ float.jl:182
; │┌ @ essentials.jl:414 within `reinterpret'
	vmovd	%xmm0, %ecx
; │└
; │ @ float.jl:252 within `Float16' @ float.jl:183
; │┌ @ float.jl:536 within `isnan'
; ││┌ @ float.jl:455 within `!='
	vucomiss	%xmm0, %xmm0
; │└└
...

On 1.6.1, macOS, x86 this is (result is correct)

julia> @code_native Float16(63343.99805)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ float.jl:180 within `Float16'
	pushq	%rax
	movabsq	$__truncdfhf2, %rax
	callq	*%rax
	popq	%rcx
	retq
	nop
; └

which looks similar to 1.6.0, linux, x86 (but result is incorrect)

julia> @code_native Float16(63343.99805)
	.text
; ┌ @ float.jl:180 within `Float16'
	pushq	%rax
	movabsq	$__truncdfhf2, %rax
	callq	*%rax
	popq	%rcx
	retq
	nop
; └

and this is 1.6.0, linux, arm, A64FX (i.e. with Float16 hardware, result is correct)

julia> @code_native Float16(63343.99805)
	.text
; ┌ @ float.jl:180 within `Float16'
	fcvt	h0, d0
	ret
	nop
	nop
	nop
	nop
	nop
	nop
; └

@maleadt Does that make sense to you? Why would 1.6 on macOS produce correct results, but 1.5 doesn't?

maleadt · 2021-06-22T16:12:21Z

Why would 1.6 on macOS produce correct results, but 1.5 doesn't?

Probably some platform-dependent error in that C implementation of truncdfhf2. That implementation is different from what we used to have in 1.5:

julia/base/float.jl

Lines 140 to 248 in 69fcb57

    
           # Float32 -> Float16 algorithm from: 
        
           #   "Fast Half Float Conversion" by Jeroen van der Zijp 
        
           #   ftp://ftp.fox-toolkit.org/pub/fasthalffloatconversion.pdf 
        
           # 
        
           # With adjustments for round-to-nearest, ties to even. 
        
           # 
        
           let _basetable = Vector{UInt16}(undef, 512), 
        
               _shifttable = Vector{UInt8}(undef, 512) 
        
               for i = 0:255 
        
                   e = i - 127 
        
                   if e < -25  # Very small numbers map to zero 
        
                       _basetable[i|0x000+1] = 0x0000 
        
                       _basetable[i|0x100+1] = 0x8000 
        
                       _shifttable[i|0x000+1] = 25 
        
                       _shifttable[i|0x100+1] = 25 
        
                   elseif e < -14  # Small numbers map to denorms 
        
                       _basetable[i|0x000+1] = 0x0000 
        
                       _basetable[i|0x100+1] = 0x8000 
        
                       _shifttable[i|0x000+1] = -e-1 
        
                       _shifttable[i|0x100+1] = -e-1 
        
                   elseif e <= 15  # Normal numbers just lose precision 
        
                       _basetable[i|0x000+1] = ((e+15)<<10) 
        
                       _basetable[i|0x100+1] = ((e+15)<<10) | 0x8000 
        
                       _shifttable[i|0x000+1] = 13 
        
                       _shifttable[i|0x100+1] = 13 
        
                   elseif e < 128  # Large numbers map to Infinity 
        
                       _basetable[i|0x000+1] = 0x7C00 
        
                       _basetable[i|0x100+1] = 0xFC00 
        
                       _shifttable[i|0x000+1] = 24 
        
                       _shifttable[i|0x100+1] = 24 
        
                   else  # Infinity and NaN's stay Infinity and NaN's 
        
                       _basetable[i|0x000+1] = 0x7C00 
        
                       _basetable[i|0x100+1] = 0xFC00 
        
                       _shifttable[i|0x000+1] = 13 
        
                       _shifttable[i|0x100+1] = 13 
        
                   end 
        
               end 
        
               global const shifttable = (_shifttable...,) 
        
               global const basetable = (_basetable...,) 
        
           end 
        
           function Float16(val::Float32) 
        
               f = reinterpret(UInt32, val) 
        
               if isnan(val) 
        
                   t = 0x8000 ⊻ (0x8000 & ((f >> 0x10) % UInt16)) 
        
                   return reinterpret(Float16, t ⊻ ((f >> 0xd) % UInt16)) 
        
               end 
        
               i = ((f & ~significand_mask(Float32)) >> significand_bits(Float32)) + 1 
        
               @inbounds sh = shifttable[i] 
        
               f &= significand_mask(Float32) 
        
               # If `val` is subnormal, the tables are set up to force the 
        
               # result to 0, so the significand has an implicit `1` in the 
        
               # cases we care about. 
        
               f |= significand_mask(Float32) + 0x1 
        
               @inbounds h = (basetable[i] + (f >> sh) & significand_mask(Float16)) % UInt16 
        
               # round 
        
               # NOTE: we maybe should ignore NaNs here, but the payload is 
        
               # getting truncated anyway so "rounding" it might not matter 
        
               nextbit = (f >> (sh-1)) & 1 
        
               if nextbit != 0 && (h & 0x7C00) != 0x7C00 
        
                   # Round halfway to even or check lower bits 
        
                   if h&1 == 1 || (f & ((1<<(sh-1))-1)) != 0 
        
                       h += UInt16(1) 
        
                   end 
        
               end 
        
               reinterpret(Float16, h) 
        
           end 
        
           function Float32(val::Float16) 
        
               local ival::UInt32 = reinterpret(UInt16, val) 
        
               local sign::UInt32 = (ival & 0x8000) >> 15 
        
               local exp::UInt32  = (ival & 0x7c00) >> 10 
        
               local sig::UInt32  = (ival & 0x3ff) >> 0 
        
               local ret::UInt32 
        
               if exp == 0 
        
                   if sig == 0 
        
                       sign = sign << 31 
        
                       ret = sign | exp | sig 
        
                   else 
        
                       n_bit = 1 
        
                       bit = 0x0200 
        
                       while (bit & sig) == 0 
        
                           n_bit = n_bit + 1 
        
                           bit = bit >> 1 
        
                       end 
        
                       sign = sign << 31 
        
                       exp = ((-14 - n_bit + 127) << 23) % UInt32 
        
                       sig = ((sig & (~bit)) << n_bit) << (23 - 10) 
        
                       ret = sign | exp | sig 
        
                   end 
        
               elseif exp == 0x1f 
        
                   if sig == 0  # Inf 
        
                       if sign == 0 
        
                           ret = 0x7f800000 
        
                       else 
        
                           ret = 0xff800000 
        
                       end 
        
                   else  # NaN 
        
                       ret = 0x7fc00000 | (sign<<31) | (sig<<(23-10)) 
        
                   end 
        
               else 
        
                   sign = sign << 31 
        
                   exp  = ((exp - 15 + 127) << 23) % UInt32 
        
                   sig  = sig << (23 - 10) 
        
                   ret = sign | exp | sig 
        
               end 
        
               return reinterpret(Float32, ret) 
        
           end

oscardssmith added the float16 label Apr 2, 2021

oscardssmith mentioned this issue Apr 16, 2021

float16 cbrt, 50% faster #39441

Merged

JeffBezanson mentioned this issue Jun 9, 2021

Float64 to Float16 conversion is slow #41161

Open

giordano mentioned this issue Jul 23, 2023

BigFloat incorrectly rounded to Float16 (subnormals) #50642

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion from Float64 to Float16 not always rounded to nearest #40315

Conversion from Float64 to Float16 not always rounded to nearest #40315

goualard-f commented Apr 2, 2021

oscardssmith commented Apr 3, 2021

kimikage commented Apr 3, 2021 •

edited

Loading

goualard-f commented Apr 3, 2021 •

edited

Loading

milankl commented Jun 21, 2021 •

edited

Loading

milankl commented Jun 21, 2021

goualard-f commented Jun 21, 2021

milankl commented Jun 21, 2021

goualard-f commented Jun 21, 2021

milankl commented Jun 21, 2021

maleadt commented Jun 22, 2021

milankl commented Jun 22, 2021

maleadt commented Jun 22, 2021

Conversion from Float64 to Float16 not always rounded to nearest #40315

Conversion from Float64 to Float16 not always rounded to nearest #40315

Comments

goualard-f commented Apr 2, 2021

oscardssmith commented Apr 3, 2021

kimikage commented Apr 3, 2021 • edited Loading

goualard-f commented Apr 3, 2021 • edited Loading

milankl commented Jun 21, 2021 • edited Loading

milankl commented Jun 21, 2021

goualard-f commented Jun 21, 2021

milankl commented Jun 21, 2021

goualard-f commented Jun 21, 2021

milankl commented Jun 21, 2021

maleadt commented Jun 22, 2021

milankl commented Jun 22, 2021

maleadt commented Jun 22, 2021

kimikage commented Apr 3, 2021 •

edited

Loading

goualard-f commented Apr 3, 2021 •

edited

Loading

milankl commented Jun 21, 2021 •

edited

Loading