Use native Float16 #69

maleadt · 2021-01-27T14:12:39Z

I had to use some ugly workarounds to get around the fact that Float16 was mapped to i16 instead of half, but that shouldn't be a problem in Julia 1.6, so that code can be cleaned up significantly.

Which other workaround are there?

thomasfaingnaert · 2021-01-27T15:28:22Z

Off the top of my head:

GemmKernels.jl/src/blas.jl

Lines 65 to 66 in abd9cba

    
           transform_shared_to_regs_c = Transform.Elementwise(x -> x * (beta / alpha)), 
        
           transform_regs_to_shared_d = Transform.Elementwise(x -> x * alpha),

: avoids FP16 multiplication of WMMA fragments by calculating D = alpha * (A * B + beta / alpha * C) instead of D = alpha * A * B + beta * C. (ref Add workaround for FP16 multiplication #27)

GemmKernels.jl/src/layout.jl

Lines 21 to 39 in abd9cba

    
           @inline @generated function vloada(::Type{Vec{N, T}}, ptr::Core.LLVMPtr{T, AS}, i::Integer = 1) where {N, T, AS} 
        
               alignment = sizeof(T) * N 
        
               vec_len = (sizeof(T) * N) ÷ sizeof(Float32) 
        
               return quote 
        
                   vec_ptr = Base.bitcast(Core.LLVMPtr{NTuple{$vec_len, VecElement{Float32}}, AS}, ptr) 
        
                   return unsafe_load(vec_ptr, (i-1) ÷ N + 1, Val($alignment)) 
        
               end 
        
           end 
        
           @inline @generated function vstorea!(::Type{Vec{N, T}}, ptr::Core.LLVMPtr{T, AS}, x, i::Integer = 1) where {N, T, AS} 
        
               alignment = sizeof(T) * N 
        
               vec_len = (sizeof(T) * N) ÷ sizeof(Float32) 
        
               return quote 
        
                   vec_ptr = Base.bitcast(Core.LLVMPtr{NTuple{$vec_len, VecElement{Float32}}, AS}, ptr) 
        
                   return unsafe_store!(vec_ptr, x, (i-1) ÷ N + 1, Val($alignment)) 
        
               end 
        
           end

: the explicit vectorisation functions need to load/store using a NTuple{4, VecElement{Float32}} instead of a NTuple{8, VecElement{Float16}} because the latter was converted to <8 x i16>, which NVPTX refuses to vectorise completely. That unfortunately meant that this "wrong" type is propagated upwards the entire call hierarchy.

GemmKernels.jl/src/layout.jl

Lines 100 to 122 in abd9cba

    
           if VERSION < v"1.6.0-DEV.1236" 
        
               @inline bitcast_helper(x::NTuple{8, VecElement{Float16}}) = Base.llvmcall( 
        
                   " 
        
                   %ret = bitcast <8 x i16> %0 to <4 x float> 
        
                   ret <4 x float> %ret 
        
                   ", NTuple{4, VecElement{Float32}}, Tuple{NTuple{8, VecElement{Float16}}}, x) 
        
           else 
        
               @inline bitcast_helper(x::NTuple{8, VecElement{Float16}}) = Base.llvmcall( 
        
                   " 
        
                   %ret = bitcast <8 x half> %0 to <4 x float> 
        
                   ret <4 x float> %ret 
        
                   ", NTuple{4, VecElement{Float32}}, Tuple{NTuple{8, VecElement{Float16}}}, x) 
        
           end 
        
           @inline function load(::Type{Diagonal{T}}, workspace, tile::Tile{size}) where {T, size} 
        
               N = 16 ÷ sizeof(T) 
        
               # The row index is given by t.index[1] + (k - 1), the column index is given by t.index[2] (0-based). 
        
               # Only load on the diagonal, i.e. if row and column are equal. 
        
               # Note that t.index[2] is 0-based, so we need to add 1 before loading from workspace. 
        
               # TODO: Remove the <4 x float> everywhere, so we don't have to do this ugly casting all over the place. 
        
               return bitcast_helper(ntuple(k -> VecElement{Float16}(tile.index[1] + k - 1 == tile.index[2] ? @inbounds(workspace[tile.index[2] + 1]) : 0), Val(8))) 
        
           end

: the bitcast_helper is not needed anymore, because the vectorisation functions no longer expect an NTuple{4, VecElement{Float32}}.

DilumAluthge

You should also change the Julia compat entry in the Project.toml file to be julia = "1.6".

DilumAluthge

Also, one of the jobs should submit coverage. Now that the 1.5 job is gone, maybe the 1.6-nightly job should submit coverage.

src/layout.jl

This reverts commit 2b94068.

DilumAluthge suggested changes Jan 27, 2021

View reviewed changes

maleadt force-pushed the tb/half branch 2 times, most recently from e8089f2 to 630308e Compare January 28, 2021 11:40

maleadt marked this pull request as ready for review January 28, 2021 11:46

thomasfaingnaert reviewed Jan 28, 2021

View reviewed changes

src/layout.jl Outdated Show resolved Hide resolved

maleadt added 3 commits February 2, 2021 12:41

Use native Float16 multiplication.

deeb718

Revert "Add workaround for FP16 multiplication (#27)"

c0d53a0

This reverts commit 2b94068.

Don't cast to Float32 for vectorization.

aca1aad

maleadt force-pushed the tb/half branch from 630308e to aca1aad Compare February 2, 2021 12:02

maleadt merged commit 40c1dac into master Feb 2, 2021

maleadt deleted the tb/half branch February 2, 2021 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use native Float16 #69

Use native Float16 #69

maleadt commented Jan 27, 2021

thomasfaingnaert commented Jan 27, 2021

DilumAluthge left a comment

DilumAluthge left a comment

Use native Float16 #69

Use native Float16 #69

Conversation

maleadt commented Jan 27, 2021

thomasfaingnaert commented Jan 27, 2021

DilumAluthge left a comment

Choose a reason for hiding this comment

DilumAluthge left a comment

Choose a reason for hiding this comment