-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware Float16 on AVX512-FP16 for SPR #46498
Comments
#46499 should eventually fix it. I can't test the changes and there might be some ABI issues with Float16 but hopefully not. |
I can test the changes on a pre-release SPR. Let me know if there are any specific tests to focus on. |
I would like to see if versioninfo detects the cpu as a sapphire rapids, and if it emits native float16 instructions as expected. versioninfo()
foo(a,b) = a + b
a = Float16(1)
b = Float16(2)
@code_llvm foo(a,b)
@code_native foo(a,b) This with the pr checked out and built |
And maybe a test to see if it vectorizes like function vecmap(arr)
out = similar(arr)
@simd for i in eachindex(arr)
out[i] = Float16(1.5)* arr[i]
end
out
end
arr = rand(Float16, 10000)
@code_llvm vecmap(arr) |
The cpu is detection of sapphirerapids is correct and the native float16 instructions are correctly emitted. However, it doesn't seem to vectorize. julia> versioninfo()
Julia Version 1.9.0-DEV.1130
Commit d64b9d8ef4 (2022-08-26 18:36 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 224 × Genuine Intel(R) CPU 0000%@
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.5 (ORCJIT, sapphirerapids)
julia> @code_llvm f(a,b) ; @ REPL[2]:1 within `f`
define half @julia_f_238(half %0, half %1) #0 {
top:
; ┌ @ float.jl:383 within `+`
%2 = fadd half %0, %1
; └
ret half %2
} julia> @code_native f(a,b) .text
.file "f"
.globl julia_f_262 # -- Begin function julia_f_262
.p2align 4, 0x90
.type julia_f_262,@function
julia_f_262: # @julia_f_262
; ┌ @ REPL[2]:1 within `f`
.cfi_startproc
# %bb.0: # %top
; │┌ @ float.jl:383 within `+`
vaddsh %xmm1, %xmm0, %xmm0
; │└
retq
.Lfunc_end0:
.size julia_f_262, .Lfunc_end0-julia_f_262
.cfi_endproc
; └
# -- End function
.section ".note.GNU-stack","",@progbits julia> function vecmap(arr)
out = similar(arr)
@simd for i in eachindex(arr)
out[i] = Float16(1.5)* arr[i]
end
out
end
vecmap (generic function with 1 method)
julia> arr = rand(Float16, 10000)
10000-element Vector{Float16}:
0.03564
0.4238
0.2988
0.0454
0.9985
0.4468
0.4312
⋮
0.0801
0.4556
0.562
0.435
0.708
0.6074
0.5283
julia> @code_llvm vecmap(arr) ; @ REPL[8]:1 within `vecmap`
define nonnull {}* @julia_vecmap_306({}* noundef nonnull align 16 dereferenceable(40) %0) #0 {
top:
; @ REPL[8]:2 within `vecmap`
; ┌ @ array.jl:369 within `similar`
; │┌ @ array.jl:148 within `size`
%1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
%2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
%3 = load i64, i64* %2, align 8
; │└
; │┌ @ boot.jl:475 within `Array`
%4 = call nonnull {}* inttoptr (i64 140397938503504 to {}* ({}*, i64)*)({}* inttoptr (i64 140397806285568 to {}*), i64 %3)
; └└
; @ REPL[8]:3 within `vecmap`
; ┌ @ simdloop.jl:69 within `macro expansion`
; │┌ @ abstractarray.jl:297 within `eachindex`
; ││┌ @ abstractarray.jl:128 within `axes1`
; │││┌ @ abstractarray.jl:95 within `axes`
; ││││┌ @ array.jl:149 within `size`
%5 = load i64, i64* %2, align 8
; │└└└└
; │ @ simdloop.jl:72 within `macro expansion`
; │┌ @ int.jl:83 within `<`
%.not = icmp eq i64 %5, 0
; │└
br i1 %.not, label %L33, label %L15.preheader
L15.preheader: ; preds = %top
%6 = bitcast {}* %4 to { i8*, i64, i16, i16, i32 }*
%7 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %6, i64 0, i32 1
%8 = bitcast {}* %0 to half**
%9 = load half*, half** %8, align 8
%10 = bitcast {}* %4 to half**
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[8]:4
; │┌ @ array.jl:957 within `setindex!`
%.pre = load i64, i64* %7, align 8
; │└
; │┌ @ essentials.jl:13 within `getindex`
br label %idxend
L33: ; preds = %idxend2, %top
; └└
; @ REPL[8]:6 within `vecmap`
ret {}* %4
idxend: ; preds = %idxend2, %L15.preheader
%value_phi8 = phi i64 [ %11, %idxend2 ], [ 0, %L15.preheader ]
; @ REPL[8]:3 within `vecmap`
; ┌ @ simdloop.jl:76 within `macro expansion`
; │┌ @ simdloop.jl:54 within `simd_index`
; ││┌ @ int.jl:87 within `+`
%11 = add nuw nsw i64 %value_phi8, 1
; │└└
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[8]:4
; │┌ @ array.jl:957 within `setindex!`
%12 = icmp ult i64 %value_phi8, %.pre
br i1 %12, label %idxend2, label %oob1
oob1: ; preds = %idxend
%13 = alloca i64, align 8
store i64 %11, i64* %13, align 8
call void @ijl_bounds_error_ints({}* %4, i64* nonnull %13, i64 1)
unreachable
idxend2: ; preds = %idxend
; │└
; │┌ @ essentials.jl:13 within `getindex`
%14 = getelementptr inbounds half, half* %9, i64 %value_phi8
%15 = load half, half* %14, align 2
; │└
; │┌ @ float.jl:385 within `*`
%16 = fmul half %15, 0xH3E00
; │└
; │┌ @ array.jl:957 within `setindex!`
%17 = load half*, half** %10, align 8
%18 = getelementptr inbounds half, half* %17, i64 %value_phi8
store half %16, half* %18, align 2
; │└
; │ @ simdloop.jl:75 within `macro expansion`
%exitcond.not = icmp eq i64 %11, %5
br i1 %exitcond.not, label %L33, label %idxend
; └
} julia> @code_native vecmap(arr) .text
.file "vecmap"
.section .rodata,"a",@progbits
.p2align 1 # -- Begin function julia_vecmap_309
.LCPI0_0:
.short 0x3e00 # half 1.5
.text
.globl julia_vecmap_309
.p2align 4, 0x90
.type julia_vecmap_309,@function
julia_vecmap_309: # @julia_vecmap_309
; ┌ @ REPL[8]:1 within `vecmap`
.cfi_startproc
# %bb.0: # %top
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
pushq %rbx
subq $8, %rsp
.cfi_offset %rbx, -24
movq %rdi, %rbx
; │ @ REPL[8]:2 within `vecmap`
; │┌ @ array.jl:369 within `similar`
; ││┌ @ array.jl:148 within `size`
movq 8(%rdi), %rsi
movabsq $140397806285568, %rdi # imm = 0x7FB0E95EF300
movabsq $140397938503504, %rax # imm = 0x7FB0F1406F50
; ││└
; ││┌ @ boot.jl:475 within `Array`
callq *%rax
; │└└
; │ @ REPL[8]:3 within `vecmap`
; │┌ @ simdloop.jl:69 within `macro expansion`
; ││┌ @ abstractarray.jl:297 within `eachindex`
; │││┌ @ abstractarray.jl:128 within `axes1`
; ││││┌ @ abstractarray.jl:95 within `axes`
; │││││┌ @ array.jl:149 within `size`
movq 8(%rbx), %rdx
; ││└└└└
; ││ @ simdloop.jl:72 within `macro expansion`
; ││┌ @ int.jl:83 within `<`
testq %rdx, %rdx
; ││└
je .LBB0_8
# %bb.1: # %L15.preheader
movq (%rbx), %rsi
; ││ @ simdloop.jl:77 within `macro expansion` @ REPL[8]:4
; ││┌ @ array.jl:957 within `setindex!`
movq 8(%rax), %rdi
xorl %ecx, %ecx
movabsq $.LCPI0_0, %rbx
vmovsh (%rbx), %xmm0
.p2align 4, 0x90
.LBB0_2: # %idxend
# =>This Inner Loop Header: Depth=1
cmpq %rdi, %rcx
jae .LBB0_3
# %bb.7: # %idxend2
# in Loop: Header=BB0_2 Depth=1
; ││└
; ││┌ @ float.jl:385 within `*`
vmulsh (%rsi,%rcx,2), %xmm0, %xmm1
; ││└
; ││┌ @ array.jl:957 within `setindex!`
movq (%rax), %rbx
vmovsh %xmm1, (%rbx,%rcx,2)
; ││└
; ││ @ simdloop.jl within `macro expansion`
leaq 1(%rcx), %rbx
movq %rbx, %rcx
; ││ @ simdloop.jl:75 within `macro expansion`
cmpq %rbx, %rdx
jne .LBB0_2
.LBB0_8: # %L33
; │└
; │ @ REPL[8]:6 within `vecmap`
leaq -8(%rbp), %rsp
popq %rbx
popq %rbp
.cfi_def_cfa %rsp, 8
retq
.LBB0_3: # %oob1
; │ @ REPL[8]:3 within `vecmap`
; │┌ @ simdloop.jl:77 within `macro expansion` @ REPL[8]:4
; ││┌ @ array.jl:957 within `setindex!`
.cfi_def_cfa %rbp, 16
movq %rsp, %rsi
movl $16, %edx
subq %rdx, %rsi
cmpq %rsp, %rsi
jge .LBB0_6
.LBB0_5: # %oob1
# =>This Inner Loop Header: Depth=1
xorq $0, (%rsp)
subq $4096, %rsp # imm = 0x1000
cmpq %rsp, %rsi
jl .LBB0_5
.LBB0_6: # %oob1
movq %rsi, %rsp
incq %rcx
movq %rcx, (%rsi)
movabsq $ijl_bounds_error_ints, %rcx
movl $1, %edx
movq %rax, %rdi
callq *%rcx
.Lfunc_end0:
.size julia_vecmap_309, .Lfunc_end0-julia_vecmap_309
.cfi_endproc
; └└└
# -- End function
.section ".note.GNU-stack","",@progbits |
My vectorizable function wasn't vectorizable at all 😄 . function vecmap(arr)
out = similar(arr)
@inbounds @simd for i in eachindex(arr)
out[i] = Float16(1.5)* arr[i]
end
out
end
arr = rand(Float16, 10000)
@code_llvm vecmap(arr) This vectorizes on an M1 mac |
Yes, this one is vectorized. Look at this beauty 😊 julia> @code_llvm vecmap(arr) ; @ REPL[1]:1 within `vecmap`
define nonnull {}* @julia_vecmap_140({}* noundef nonnull align 16 dereferenceable(40) %0) #0 {
top:
; @ REPL[1]:2 within `vecmap`
; ┌ @ array.jl:369 within `similar`
; │┌ @ array.jl:148 within `size`
%1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
%2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
%3 = load i64, i64* %2, align 8
; │└
; │┌ @ boot.jl:475 within `Array`
%4 = call nonnull {}* inttoptr (i64 139892987219792 to {}* ({}*, i64)*)({}* inttoptr (i64 139892815207312 to {}*), i64 %3)
; └└
; @ REPL[1]:3 within `vecmap`
; ┌ @ simdloop.jl:69 within `macro expansion`
; │┌ @ abstractarray.jl:297 within `eachindex`
; ││┌ @ abstractarray.jl:128 within `axes1`
; │││┌ @ abstractarray.jl:95 within `axes`
; ││││┌ @ array.jl:149 within `size`
%5 = load i64, i64* %2, align 8
; │└└└└
; │ @ simdloop.jl:72 within `macro expansion`
; │┌ @ int.jl:83 within `<`
%.not = icmp eq i64 %5, 0
; │└
br i1 %.not, label %L33, label %iter.check
iter.check: ; preds = %top
%6 = bitcast {}* %0 to half**
%7 = load half*, half** %6, align 8
%8 = bitcast {}* %4 to half**
%9 = load half*, half** %8, align 8
; │ @ simdloop.jl:75 within `macro expansion`
%min.iters.check = icmp ult i64 %5, 8
br i1 %min.iters.check, label %vec.epilog.scalar.ph, label %vector.memcheck
vector.memcheck: ; preds = %iter.check
%scevgep = getelementptr half, half* %9, i64 %5
%scevgep7 = getelementptr half, half* %7, i64 %5
%bound0 = icmp ult half* %9, %scevgep7
%bound1 = icmp ult half* %7, %scevgep
%found.conflict = and i1 %bound0, %bound1
br i1 %found.conflict, label %vec.epilog.scalar.ph, label %vector.main.loop.iter.check
vector.main.loop.iter.check: ; preds = %vector.memcheck
%min.iters.check9 = icmp ult i64 %5, 64
br i1 %min.iters.check9, label %vec.epilog.ph, label %vector.ph
vector.ph: ; preds = %vector.main.loop.iter.check
%n.vec = and i64 %5, 9223372036854775744
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%10 = getelementptr inbounds half, half* %7, i64 %index
; │└
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[1]:4
; │┌ @ essentials.jl:13 within `getindex`
%11 = bitcast half* %10 to <16 x half>*
%wide.load = load <16 x half>, <16 x half>* %11, align 2
%12 = getelementptr inbounds half, half* %10, i64 16
%13 = bitcast half* %12 to <16 x half>*
%wide.load10 = load <16 x half>, <16 x half>* %13, align 2
%14 = getelementptr inbounds half, half* %10, i64 32
%15 = bitcast half* %14 to <16 x half>*
%wide.load11 = load <16 x half>, <16 x half>* %15, align 2
%16 = getelementptr inbounds half, half* %10, i64 48
%17 = bitcast half* %16 to <16 x half>*
%wide.load12 = load <16 x half>, <16 x half>* %17, align 2
; │└
; │┌ @ float.jl:385 within `*`
%18 = fmul <16 x half> %wide.load, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
%19 = fmul <16 x half> %wide.load10, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
%20 = fmul <16 x half> %wide.load11, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
%21 = fmul <16 x half> %wide.load12, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
; │└
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
%22 = getelementptr inbounds half, half* %9, i64 %index
; │└
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[1]:4
; │┌ @ array.jl:957 within `setindex!`
%23 = bitcast half* %22 to <16 x half>*
store <16 x half> %18, <16 x half>* %23, align 2
%24 = getelementptr inbounds half, half* %22, i64 16
%25 = bitcast half* %24 to <16 x half>*
store <16 x half> %19, <16 x half>* %25, align 2
%26 = getelementptr inbounds half, half* %22, i64 32
%27 = bitcast half* %26 to <16 x half>*
store <16 x half> %20, <16 x half>* %27, align 2
%28 = getelementptr inbounds half, half* %22, i64 48
%29 = bitcast half* %28 to <16 x half>*
store <16 x half> %21, <16 x half>* %29, align 2
; │└
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
%index.next = add nuw i64 %index, 64
%30 = icmp eq i64 %index.next, %n.vec
br i1 %30, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
; │└
; │ @ simdloop.jl:75 within `macro expansion`
%cmp.n = icmp eq i64 %5, %n.vec
br i1 %cmp.n, label %L33, label %vec.epilog.iter.check
vec.epilog.iter.check: ; preds = %middle.block
%n.vec.remaining = and i64 %5, 56
%min.epilog.iters.check = icmp eq i64 %n.vec.remaining, 0
br i1 %min.epilog.iters.check, label %vec.epilog.scalar.ph, label %vec.epilog.ph
vec.epilog.ph: ; preds = %vec.epilog.iter.check, %vector.main.loop.iter.check
%vec.epilog.resume.val = phi i64 [ %n.vec, %vec.epilog.iter.check ], [ 0, %vector.main.loop.iter.check ]
%n.vec14 = and i64 %5, 9223372036854775800
br label %vec.epilog.vector.body
vec.epilog.vector.body: ; preds = %vec.epilog.vector.body, %vec.epilog.ph
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
%index16 = phi i64 [ %vec.epilog.resume.val, %vec.epilog.ph ], [ %index.next18, %vec.epilog.vector.body ]
%31 = getelementptr inbounds half, half* %7, i64 %index16
; │└
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[1]:4
; │┌ @ essentials.jl:13 within `getindex`
%32 = bitcast half* %31 to <8 x half>*
%wide.load17 = load <8 x half>, <8 x half>* %32, align 2
; │└
; │┌ @ float.jl:385 within `*`
%33 = fmul <8 x half> %wide.load17, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
; │└
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
%34 = getelementptr inbounds half, half* %9, i64 %index16
; │└
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[1]:4
; │┌ @ array.jl:957 within `setindex!`
%35 = bitcast half* %34 to <8 x half>*
store <8 x half> %33, <8 x half>* %35, align 2
; │└
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
%index.next18 = add nuw i64 %index16, 8
%36 = icmp eq i64 %index.next18, %n.vec14
br i1 %36, label %vec.epilog.middle.block, label %vec.epilog.vector.body
vec.epilog.middle.block: ; preds = %vec.epilog.vector.body
; │└
; │ @ simdloop.jl:75 within `macro expansion`
%cmp.n15 = icmp eq i64 %5, %n.vec14
br i1 %cmp.n15, label %L33, label %vec.epilog.scalar.ph
vec.epilog.scalar.ph: ; preds = %vec.epilog.middle.block, %vec.epilog.iter.check, %vector.memcheck, %iter.check
%bc.resume.val = phi i64 [ %n.vec14, %vec.epilog.middle.block ], [ %n.vec, %vec.epilog.iter.check ], [ 0, %vector.memcheck ], [ 0, %iter.check ]
br label %L15
L15: ; preds = %L15, %vec.epilog.scalar.ph
%value_phi5 = phi i64 [ %bc.resume.val, %vec.epilog.scalar.ph ], [ %41, %L15 ]
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[1]:4
; │┌ @ essentials.jl:13 within `getindex`
%37 = getelementptr inbounds half, half* %7, i64 %value_phi5
%38 = load half, half* %37, align 2
; │└
; │┌ @ float.jl:385 within `*`
%39 = fmul half %38, 0xH3E00
; │└
; │┌ @ array.jl:957 within `setindex!`
%40 = getelementptr inbounds half, half* %9, i64 %value_phi5
store half %39, half* %40, align 2
; │└
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
%41 = add nuw nsw i64 %value_phi5, 1
; │└
; │ @ simdloop.jl:75 within `macro expansion`
; │┌ @ int.jl:83 within `<`
%exitcond.not = icmp eq i64 %41, %5
; │└
br i1 %exitcond.not, label %L33, label %L15
L33: ; preds = %L15, %vec.epilog.middle.block, %middle.block, %top
; └
; @ REPL[1]:6 within `vecmap`
ret {}* %4
} Thanks! |
Last time I played with julia nightly on a64fx a couple of months ago I was able to get vscale instructions even without |
Also note that the vector length is <16 x half> while AVX512-FP16 can do <32 x half>. |
That is odd it chose to generate <16 x half> vectors. Does clang do 32 size vectors for similar code in C/C++? |
LLVM 14 vectorized to <32 x half>: void foo(__fp16 *a)
{
__fp16 c = 1.5;
for (int i = 0; i < 10000; i++) {
a[i] = c*a[i];
}
} .text
.file "test.c"
.section .rodata,"a",@progbits
.p2align 1 # -- Begin function foo
.LCPI0_0:
.short 0x3e00 # half 1.5
.text
.globl foo
.p2align 4, 0x90
.type foo,@function
foo: # @foo
.cfi_startproc
# %bb.0: # %iter.check
movl $224, %eax
vpbroadcastw .LCPI0_0(%rip), %zmm0 # zmm0 = [1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+
0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+
0,1.5E+0,1.5E+0]
.p2align 4, 0x90
.LBB0_1: # %vector.body
# =>This Inner Loop Header: Depth=1
vmulph -448(%rdi,%rax,2), %zmm0, %zmm1
vmulph -384(%rdi,%rax,2), %zmm0, %zmm2
vmulph -320(%rdi,%rax,2), %zmm0, %zmm3
vmulph -256(%rdi,%rax,2), %zmm0, %zmm4
vmovups %zmm1, -448(%rdi,%rax,2)
vmovups %zmm2, -384(%rdi,%rax,2)
vmovups %zmm3, -320(%rdi,%rax,2)
vmovups %zmm4, -256(%rdi,%rax,2)
vmulph -192(%rdi,%rax,2), %zmm0, %zmm1
vmulph -128(%rdi,%rax,2), %zmm0, %zmm2
vmulph -64(%rdi,%rax,2), %zmm0, %zmm3
vmulph (%rdi,%rax,2), %zmm0, %zmm4
vmovups %zmm1, -192(%rdi,%rax,2)
vmovups %zmm2, -128(%rdi,%rax,2)
vmovups %zmm3, -64(%rdi,%rax,2)
vmovups %zmm4, (%rdi,%rax,2)
addq $256, %rax # imm = 0x100
cmpq $10208, %rax # imm = 0x27E0
jne .LBB0_1
# %bb.2: # %vec.epilog.vector.body
vmovups 19968(%rdi), %ymm0
vmulph .LCPI0_0(%rip){1to16}, %ymm0, %ymm0
vmovups %ymm0, 19968(%rdi)
vzeroupper
retq
.Lfunc_end0:
.size foo, .Lfunc_end0-foo
.cfi_endproc
# -- End function
.ident "clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)"
.section ".note.GNU-stack","",@progbits
.addrsig *** IR Dump After LoopVectorizePass on foo ***
; Function Attrs: nofree norecurse nosync nounwind uwtable
define dso_local void @foo(half* nocapture noundef %0) local_unnamed_addr #0 {
iter.check:
br i1 false, label %vec.epilog.scalar.ph, label %vector.main.loop.iter.check
vector.main.loop.iter.check: ; preds = %iter.check
br i1 false, label %vec.epilog.ph, label %vector.ph
vector.ph: ; preds = %vector.main.loop.iter.check
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%1 = add i64 %index, 0
%2 = add i64 %index, 32
%3 = add i64 %index, 64
%4 = add i64 %index, 96
%5 = getelementptr inbounds half, half* %0, i64 %1
%6 = getelementptr inbounds half, half* %0, i64 %2
%7 = getelementptr inbounds half, half* %0, i64 %3
%8 = getelementptr inbounds half, half* %0, i64 %4
%9 = getelementptr inbounds half, half* %5, i32 0
%10 = bitcast half* %9 to <32 x half>*
%wide.load = load <32 x half>, <32 x half>* %10, align 2, !tbaa !3
%11 = getelementptr inbounds half, half* %5, i32 32
%12 = bitcast half* %11 to <32 x half>*
%wide.load11 = load <32 x half>, <32 x half>* %12, align 2, !tbaa !3
%13 = getelementptr inbounds half, half* %5, i32 64
%14 = bitcast half* %13 to <32 x half>*
%wide.load12 = load <32 x half>, <32 x half>* %14, align 2, !tbaa !3
%15 = getelementptr inbounds half, half* %5, i32 96
%16 = bitcast half* %15 to <32 x half>*
%wide.load13 = load <32 x half>, <32 x half>* %16, align 2, !tbaa !3
%17 = fmul <32 x half> %wide.load, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
%18 = fmul <32 x half> %wide.load11, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
%19 = fmul <32 x half> %wide.load12, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
%20 = fmul <32 x half> %wide.load13, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
%21 = bitcast half* %9 to <32 x half>*
store <32 x half> %17, <32 x half>* %21, align 2, !tbaa !3
%22 = bitcast half* %11 to <32 x half>*
store <32 x half> %18, <32 x half>* %22, align 2, !tbaa !3
%23 = bitcast half* %13 to <32 x half>*
store <32 x half> %19, <32 x half>* %23, align 2, !tbaa !3
%24 = bitcast half* %15 to <32 x half>*
store <32 x half> %20, <32 x half>* %24, align 2, !tbaa !3
%index.next = add nuw i64 %index, 128
%25 = icmp eq i64 %index.next, 9984
br i1 %25, label %middle.block, label %vector.body, !llvm.loop !7
middle.block: ; preds = %vector.body
%cmp.n = icmp eq i64 10000, 9984
br i1 %cmp.n, label %33, label %vec.epilog.iter.check
vec.epilog.iter.check: ; preds = %middle.block
br i1 false, label %vec.epilog.scalar.ph, label %vec.epilog.ph
vec.epilog.ph: ; preds = %vector.main.loop.iter.check, %vec.epilog.iter.check
%vec.epilog.resume.val = phi i64 [ 9984, %vec.epilog.iter.check ], [ 0, %vector.main.loop.iter.check ]
br label %vec.epilog.vector.body
vec.epilog.vector.body: ; preds = %vec.epilog.vector.body, %vec.epilog.ph
%index15 = phi i64 [ %vec.epilog.resume.val, %vec.epilog.ph ], [ %index.next17, %vec.epilog.vector.body ]
%26 = add i64 %index15, 0
%27 = getelementptr inbounds half, half* %0, i64 %26
%28 = getelementptr inbounds half, half* %27, i32 0
%29 = bitcast half* %28 to <16 x half>*
%wide.load16 = load <16 x half>, <16 x half>* %29, align 2, !tbaa !3
%30 = fmul <16 x half> %wide.load16, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
%31 = bitcast half* %28 to <16 x half>*
store <16 x half> %30, <16 x half>* %31, align 2, !tbaa !3
%index.next17 = add nuw i64 %index15, 16
%32 = icmp eq i64 %index.next17, 10000
br i1 %32, label %vec.epilog.middle.block, label %vec.epilog.vector.body, !llvm.loop !10
vec.epilog.middle.block: ; preds = %vec.epilog.vector.body
%cmp.n14 = icmp eq i64 10000, 10000
br i1 %cmp.n14, label %.loopexit, label %vec.epilog.scalar.ph
vec.epilog.scalar.ph: ; preds = %iter.check, %vec.epilog.iter.check, %vec.epilog.middle.block
%bc.resume.val = phi i64 [ 10000, %vec.epilog.middle.block ], [ 9984, %vec.epilog.iter.check ], [ 0, %iter.check ]
br label %34
.loopexit: ; preds = %vec.epilog.middle.block, %34
br label %33
33: ; preds = %.loopexit, %middle.block
ret void
34: ; preds = %vec.epilog.scalar.ph, %34
%indvars.iv = phi i64 [ %bc.resume.val, %vec.epilog.scalar.ph ], [ %indvars.iv.next, %34 ]
%35 = getelementptr inbounds half, half* %0, i64 %indvars.iv
%36 = load half, half* %35, align 2, !tbaa !3
%37 = fmul half %36, 0xH3E00
store half %37, half* %35, align 2, !tbaa !3
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond.not = icmp eq i64 %indvars.iv.next, 10000
br i1 %exitcond.not, label %.loopexit, label %34, !llvm.loop !12
} |
@chriselrod Any idea why it would do that? |
You need to start Julia with E.g., I tend to start Julia with something like
|
Why do we not default to the max size? |
My flippant response would be "because LLVM is bad at SIMD with large vectors". For example, you can try this script. using LoopVectorization, Random, BenchmarkTools, LinuxPerf
@inline function squared_norm_simd(x)
s = zero(eltype(x))
@simd for i = eachindex(x)
@inbounds s += x[i]^2
end
s
end
@inline function squared_norm_turbo(x)
s = zero(eltype(x))
@turbo for i = eachindex(x)
s += x[i]^2
end
s
end
function test_all_sizes!(f::F, y, x, Ns) where {F}
for n = eachindex(Ns)
y[n] = f(@view(x[1:Ns[n]]))
end
end
foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> f(arg1, args...), 1:N)
x = rand(256); # specify the max problem size (and number of problems) here
y = similar(x);
Ns = randperm(Random.Xoshiro(123), length(x));
@benchmark test_all_sizes!(squared_norm_simd, $y, $x, $Ns)
@benchmark test_all_sizes!(squared_norm_turbo, $y, $x, $Ns)
@pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
@time foreachf(test_all_sizes!, 10, squared_norm_simd, y, x, Ns)
end
@pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
@time foreachf(test_all_sizes!, 10, squared_norm_turbo, y, x, Ns)
end
@pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
@time foreachf(test_all_sizes!, 1_000_000, squared_norm_simd, y, x, Ns)
end
@pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
@time foreachf(test_all_sizes!, 1_000_000, squared_norm_turbo, y, x, Ns)
end What this does is benchmarks a squared norm (or dot product of a vector with itself) of all sizes Note that if you adjust the max problem size, you'll probably also want to adjust the number of iterations used in Results using a 10980XE CPU, under default settings (note that julia> @benchmark test_all_sizes!(squared_norm_simd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min … max): 3.192 μs … 4.551 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.208 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.215 μs ± 32.882 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▇
▂▃████▅▃▂▃▃▃▃▃▃▃▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂ ▃
3.19 μs Histogram: frequency by time 3.38 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark test_all_sizes!(squared_norm_turbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.753 μs … 5.131 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.758 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.762 μs ± 47.027 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▃█▆▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂ ▂
1.75 μs Histogram: frequency by time 1.92 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
@time foreachf(test_all_sizes!, 1_000_000, squared_norm_simd, y, x, Ns)
end
3.239151 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 1.36e+10 33.3% # 4.2 cycles per ns
└ task-clock 3.24e+09 33.3% # 3.2 s
┌ instructions 4.26e+10 66.7% # 3.1 insns per cycle
│ branch-instructions 5.75e+09 66.7% # 13.5% of insns
└ branch-misses 9.62e+06 66.7% # 0.2% of branch insns
┌ L1-dcache-load-misses 8.52e+04 33.3% # 0.0% of dcache loads
│ L1-dcache-loads 1.01e+10 33.3%
│ cache-misses 4.80e+01 33.3% # 94.1% of cache refs
└ cache-references 5.10e+01 33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
@time foreachf(test_all_sizes!, 1_000_000, squared_norm_turbo, y, x, Ns)
end
1.798680 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 7.17e+09 33.3% # 4.0 cycles per ns
└ task-clock 1.80e+09 33.3% # 1.8 s
┌ instructions 2.76e+10 66.7% # 3.8 insns per cycle
│ branch-instructions 2.87e+09 66.7% # 10.4% of insns
└ branch-misses 3.09e+04 66.7% # 0.0% of branch insns
┌ L1-dcache-load-misses 8.26e+04 33.3% # 0.0% of dcache loads
│ L1-dcache-loads 4.97e+09 33.3%
│ cache-misses 2.30e+03 33.3% # 23.3% of cache refs
└ cache-references 9.87e+03 33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Using 512 bit vectors, So, let's try Julia with julia> @benchmark test_all_sizes!(squared_norm_simd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max): 4.292 μs … 8.629 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.329 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.334 μs ± 61.638 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▁▂▅▅▇█▃
▂▂▃▆█████████▇▄▃▂▂▁▁▁▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
4.29 μs Histogram: frequency by time 4.52 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark test_all_sizes!(squared_norm_turbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.884 μs … 5.338 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.889 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.892 μs ± 39.159 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▇█▆ ▂
████▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▅▆▇██ █
1.88 μs Histogram: log(frequency) by time 2.02 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
@time foreachf(test_all_sizes!, 1_000_000, squared_norm_simd, y, x, Ns)
end
4.359129 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 1.74e+10 33.3% # 4.0 cycles per ns
└ task-clock 4.36e+09 33.3% # 4.4 s
┌ instructions 4.17e+10 66.7% # 2.4 insns per cycle
│ branch-instructions 6.78e+09 66.7% # 16.3% of insns
└ branch-misses 4.92e+07 66.7% # 0.7% of branch insns
┌ L1-dcache-load-misses 3.64e+05 33.3% # 0.0% of dcache loads
│ L1-dcache-loads 8.07e+09 33.3%
│ cache-misses 1.20e+01 33.3% # 13.8% of cache refs
└ cache-references 8.70e+01 33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
@time foreachf(test_all_sizes!, 1_000_000, squared_norm_turbo, y, x, Ns)
end
1.931807 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 7.70e+09 33.3% # 4.0 cycles per ns
└ task-clock 1.93e+09 33.3% # 1.9 s
┌ instructions 2.76e+10 66.7% # 3.6 insns per cycle
│ branch-instructions 2.87e+09 66.7% # 10.4% of insns
└ branch-misses 2.77e+04 66.7% # 0.0% of branch insns
┌ L1-dcache-load-misses 1.09e+05 33.3% # 0.0% of dcache loads
│ L1-dcache-loads 4.97e+09 33.3%
│ cache-misses 6.90e+01 33.3% # 40.4% of cache refs
└ cache-references 1.71e+02 33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ The So, my response to "why does LLVM use 256 bit vectors by default?" is "because LLVM is so bad at generating code for 512 bit vectors even in trivial cases that it actually runs faster generating 256 bit code instead, even though if you actually do a good job, you'll get the expected performance boost from bigger vectors". Of course, LoopVectorization.jl has tons of issues, which is why I've been working on LoopModels, which will work at the LLVM level (side stepping most of the issues) and include dependence analysis to actually generate correct code in the face of dependencies, instead of optimistically assuming code is free of dependencies aside from obvious reductions (the dependence analysis has already been implemented, I'm starting on cost modeling now, which is of course a part LoopVectorization.jl does reasonably well, but now with the benefit of access to LLVM's much more accurate costs instead of the table LoopVectorization.jl uses). |
I see. That's annoying, is that a scheduler issue or just bad codegen overall? |
Bad codegen overall. EDIT: |
Digging a little bit more on the LLVM vectorizer on 512 bit vector, it generates 3 loops:
On 256 bit vector, it generates:
The number of iterations of final scalar loop is the same for both cases. The performance differences are coming from following two differences:
To reproduce:
clang -O2 -ffast-math -mavx512f --save-temps -v -mllvm -print-after-all -c squared_norm_simd.c 2> vec512.log The LLVM and assembly dumps: *** IR Dump After LoopVectorizePass on squared_norm_simd ***
; Function Attrs: nofree norecurse nosync nounwind readonly uwtable
define dso_local float @squared_norm_simd(float* nocapture noundef readonly %0, i32 noundef %1) local_unnamed_addr #0 {
%3 = icmp sgt i32 %1, 0
br i1 %3, label %iter.check, label %._crit_edge
iter.check: ; preds = %2
%wide.trip.count = zext i32 %1 to i64
%min.iters.check = icmp ult i64 %wide.trip.count, 8
br i1 %min.iters.check, label %vec.epilog.scalar.ph, label %vector.main.loop.iter.check
vector.main.loop.iter.check: ; preds = %iter.check
%min.iters.check15 = icmp ult i64 %wide.trip.count, 64
br i1 %min.iters.check15, label %vec.epilog.ph, label %vector.ph
vector.ph: ; preds = %vector.main.loop.iter.check
%n.mod.vf = urem i64 %wide.trip.count, 64
%n.vec = sub i64 %wide.trip.count, %n.mod.vf
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <16 x float> [ zeroinitializer, %vector.ph ], [ %24, %vector.body ]
%vec.phi16 = phi <16 x float> [ zeroinitializer, %vector.ph ], [ %25, %vector.body ]
%vec.phi17 = phi <16 x float> [ zeroinitializer, %vector.ph ], [ %26, %vector.body ]
%vec.phi18 = phi <16 x float> [ zeroinitializer, %vector.ph ], [ %27, %vector.body ]
%4 = add i64 %index, 0
%5 = add i64 %index, 16
%6 = add i64 %index, 32
%7 = add i64 %index, 48
%8 = getelementptr inbounds float, float* %0, i64 %4
%9 = getelementptr inbounds float, float* %0, i64 %5
%10 = getelementptr inbounds float, float* %0, i64 %6
%11 = getelementptr inbounds float, float* %0, i64 %7
%12 = getelementptr inbounds float, float* %8, i32 0
%13 = bitcast float* %12 to <16 x float>*
%wide.load = load <16 x float>, <16 x float>* %13, align 4, !tbaa !3
%14 = getelementptr inbounds float, float* %8, i32 16
%15 = bitcast float* %14 to <16 x float>*
%wide.load19 = load <16 x float>, <16 x float>* %15, align 4, !tbaa !3
%16 = getelementptr inbounds float, float* %8, i32 32
%17 = bitcast float* %16 to <16 x float>*
%wide.load20 = load <16 x float>, <16 x float>* %17, align 4, !tbaa !3
%18 = getelementptr inbounds float, float* %8, i32 48
%19 = bitcast float* %18 to <16 x float>*
%wide.load21 = load <16 x float>, <16 x float>* %19, align 4, !tbaa !3
%20 = fmul fast <16 x float> %wide.load, %wide.load
%21 = fmul fast <16 x float> %wide.load19, %wide.load19
%22 = fmul fast <16 x float> %wide.load20, %wide.load20
%23 = fmul fast <16 x float> %wide.load21, %wide.load21
%24 = fadd fast <16 x float> %20, %vec.phi
%25 = fadd fast <16 x float> %21, %vec.phi16
%26 = fadd fast <16 x float> %22, %vec.phi17
%27 = fadd fast <16 x float> %23, %vec.phi18
%index.next = add nuw i64 %index, 64
%28 = icmp eq i64 %index.next, %n.vec
br i1 %28, label %middle.block, label %vector.body, !llvm.loop !7
middle.block: ; preds = %vector.body
%bin.rdx = fadd fast <16 x float> %25, %24
%bin.rdx22 = fadd fast <16 x float> %26, %bin.rdx
%bin.rdx23 = fadd fast <16 x float> %27, %bin.rdx22
%29 = call fast float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %bin.rdx23)
%cmp.n = icmp eq i64 %wide.trip.count, %n.vec
br i1 %cmp.n, label %._crit_edge.loopexit, label %vec.epilog.iter.check
vec.epilog.iter.check: ; preds = %middle.block
%n.vec.remaining = sub i64 %wide.trip.count, %n.vec
%min.epilog.iters.check = icmp ult i64 %n.vec.remaining, 8
br i1 %min.epilog.iters.check, label %vec.epilog.scalar.ph, label %vec.epilog.ph
vec.epilog.ph: ; preds = %vector.main.loop.iter.check, %vec.epilog.iter.check
%bc.merge.rdx = phi float [ 0.000000e+00, %vector.main.loop.iter.check ], [ %29, %vec.epilog.iter.check ]
%vec.epilog.resume.val = phi i64 [ %n.vec, %vec.epilog.iter.check ], [ 0, %vector.main.loop.iter.check ]
%n.mod.vf25 = urem i64 %wide.trip.count, 8
%n.vec26 = sub i64 %wide.trip.count, %n.mod.vf25
%30 = insertelement <8 x float> zeroinitializer, float %bc.merge.rdx, i32 0
br label %vec.epilog.vector.body
vec.epilog.vector.body: ; preds = %vec.epilog.vector.body, %vec.epilog.ph
%index28 = phi i64 [ %vec.epilog.resume.val, %vec.epilog.ph ], [ %index.next31, %vec.epilog.vector.body ]
%vec.phi29 = phi <8 x float> [ %30, %vec.epilog.ph ], [ %36, %vec.epilog.vector.body ]
%31 = add i64 %index28, 0
%32 = getelementptr inbounds float, float* %0, i64 %31
%33 = getelementptr inbounds float, float* %32, i32 0
%34 = bitcast float* %33 to <8 x float>*
%wide.load30 = load <8 x float>, <8 x float>* %34, align 4, !tbaa !3
%35 = fmul fast <8 x float> %wide.load30, %wide.load30
%36 = fadd fast <8 x float> %35, %vec.phi29
%index.next31 = add nuw i64 %index28, 8
%37 = icmp eq i64 %index.next31, %n.vec26
br i1 %37, label %vec.epilog.middle.block, label %vec.epilog.vector.body, !llvm.loop !10
vec.epilog.middle.block: ; preds = %vec.epilog.vector.body
%38 = call fast float @llvm.vector.reduce.fadd.v8f32(float -0.000000e+00, <8 x float> %36)
%cmp.n27 = icmp eq i64 %wide.trip.count, %n.vec26
br i1 %cmp.n27, label %._crit_edge.loopexit.loopexit, label %vec.epilog.scalar.ph
vec.epilog.scalar.ph: ; preds = %iter.check, %vec.epilog.iter.check, %vec.epilog.middle.block
%bc.resume.val = phi i64 [ %n.vec26, %vec.epilog.middle.block ], [ %n.vec, %vec.epilog.iter.check ], [ 0, %iter.check ]
%bc.merge.rdx32 = phi float [ 0.000000e+00, %iter.check ], [ %29, %vec.epilog.iter.check ], [ %38, %vec.epilog.middle.block ]
br label %.lr.ph
._crit_edge.loopexit.loopexit: ; preds = %vec.epilog.middle.block, %.lr.ph
%.lcssa24 = phi float [ %42, %.lr.ph ], [ %38, %vec.epilog.middle.block ]
br label %._crit_edge.loopexit
._crit_edge.loopexit: ; preds = %._crit_edge.loopexit.loopexit, %middle.block
%.lcssa = phi float [ %29, %middle.block ], [ %.lcssa24, %._crit_edge.loopexit.loopexit ]
br label %._crit_edge
._crit_edge: ; preds = %._crit_edge.loopexit, %2
%.011.lcssa = phi float [ 0.000000e+00, %2 ], [ %.lcssa, %._crit_edge.loopexit ]
ret float %.011.lcssa
.lr.ph: ; preds = %vec.epilog.scalar.ph, %.lr.ph
%indvars.iv = phi i64 [ %bc.resume.val, %vec.epilog.scalar.ph ], [ %indvars.iv.next, %.lr.ph ]
%.01112 = phi float [ %bc.merge.rdx32, %vec.epilog.scalar.ph ], [ %42, %.lr.ph ]
%39 = getelementptr inbounds float, float* %0, i64 %indvars.iv
%40 = load float, float* %39, align 4, !tbaa !3
%41 = fmul fast float %40, %40
%42 = fadd fast float %41, %.01112
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
br i1 %exitcond.not, label %._crit_edge.loopexit.loopexit, label %.lr.ph, !llvm.loop !12
}
|
FWIW, .text
.file "squared_norm_turbo"
.globl julia_squared_norm_turbo_4421 # -- Begin function julia_squared_norm_turbo_4421
.p2align 4, 0x90
.type julia_squared_norm_turbo_4421,@function
julia_squared_norm_turbo_4421: # @julia_squared_norm_turbo_4421
.cfi_startproc
# %bb.0: # %top
movabs rsi, 9223372036854775744
mov rax, qword ptr [rdi]
mov rdx, qword ptr [rdi + 8]
cmp rdx, 64
jae .LBB0_2
# %bb.1:
vxorpd xmm0, xmm0, xmm0
xor ecx, ecx
vxorpd xmm1, xmm1, xmm1
add rsi, 48
and rsi, rdx
cmp rcx, rsi
je .LBB0_8
jmp .LBB0_6
.LBB0_2: # %L30.preheader
mov rcx, rdx
and rcx, rsi
vxorpd xmm1, xmm1, xmm1
xor edi, edi
vxorpd xmm0, xmm0, xmm0
vxorpd xmm3, xmm3, xmm3
vxorpd xmm4, xmm4, xmm4
vxorpd xmm2, xmm2, xmm2
vxorpd xmm5, xmm5, xmm5
vxorpd xmm6, xmm6, xmm6
vxorpd xmm7, xmm7, xmm7
.p2align 4, 0x90
.LBB0_3: # %L49
# =>This Inner Loop Header: Depth=1
vmovupd zmm8, zmmword ptr [rax + 8*rdi]
vmovupd zmm9, zmmword ptr [rax + 8*rdi + 64]
vmovupd zmm10, zmmword ptr [rax + 8*rdi + 128]
vmovupd zmm11, zmmword ptr [rax + 8*rdi + 192]
vmovupd zmm12, zmmword ptr [rax + 8*rdi + 256]
vmovupd zmm13, zmmword ptr [rax + 8*rdi + 320]
vmovupd zmm14, zmmword ptr [rax + 8*rdi + 384]
vmovupd zmm15, zmmword ptr [rax + 8*rdi + 448]
vfmadd231pd zmm7, zmm8, zmm8 # zmm7 = (zmm8 * zmm8) + zmm7
vfmadd231pd zmm6, zmm9, zmm9 # zmm6 = (zmm9 * zmm9) + zmm6
vfmadd231pd zmm5, zmm10, zmm10 # zmm5 = (zmm10 * zmm10) + zmm5
vfmadd231pd zmm2, zmm11, zmm11 # zmm2 = (zmm11 * zmm11) + zmm2
vfmadd231pd zmm4, zmm12, zmm12 # zmm4 = (zmm12 * zmm12) + zmm4
vfmadd231pd zmm3, zmm13, zmm13 # zmm3 = (zmm13 * zmm13) + zmm3
vfmadd231pd zmm0, zmm14, zmm14 # zmm0 = (zmm14 * zmm14) + zmm0
vfmadd231pd zmm1, zmm15, zmm15 # zmm1 = (zmm15 * zmm15) + zmm1
add rdi, 64
cmp rcx, rdi
jne .LBB0_3
# %bb.4: # %L79
vaddpd zmm4, zmm7, zmm4
vaddpd zmm3, zmm6, zmm3
vaddpd zmm0, zmm5, zmm0
vaddpd zmm0, zmm4, zmm0
vaddpd zmm1, zmm2, zmm1
vaddpd zmm1, zmm3, zmm1
add rsi, 48
and rsi, rdx
cmp rcx, rsi
je .LBB0_8
.p2align 4, 0x90
.LBB0_6: # %L114
# =>This Inner Loop Header: Depth=1
vmovupd zmm2, zmmword ptr [rax + 8*rcx]
vmovupd zmm3, zmmword ptr [rax + 8*rcx + 64]
vfmadd231pd zmm0, zmm2, zmm2 # zmm0 = (zmm2 * zmm2) + zmm0
vfmadd231pd zmm1, zmm3, zmm3 # zmm1 = (zmm3 * zmm3) + zmm1
add rcx, 16
cmp rsi, rcx
jne .LBB0_6
# %bb.7:
mov rcx, rsi
.LBB0_8: # %L123
cmp rcx, rdx
jae .LBB0_12
# %bb.9: # %L125
mov esi, edx
and esi, 7
mov edi, 8
cmovne edi, esi
mov esi, -1
bzhi esi, esi, edi
add rdx, -9
cmp rdx, rcx
jge .LBB0_11
# %bb.10: # %L136
kmovd k1, esi
vmovupd zmm2 {k1} {z}, zmmword ptr [rax + 8*rcx]
vfmadd231pd zmm0 {k1}, zmm2, zmm2 # zmm0 {%k1} = (zmm2 * zmm2) + zmm0
jmp .LBB0_12
.LBB0_11: # %L143
vmovupd zmm2, zmmword ptr [rax + 8*rcx]
kmovd k1, esi
vmovupd zmm3 {k1} {z}, zmmword ptr [rax + 8*rcx + 64]
vfmadd231pd zmm0, zmm2, zmm2 # zmm0 = (zmm2 * zmm2) + zmm0
vfmadd231pd zmm1 {k1}, zmm3, zmm3 # zmm1 {%k1} = (zmm3 * zmm3) + zmm1
.LBB0_12: # %L152
vaddpd zmm0, zmm0, zmm1
vextractf64x4 ymm1, zmm0, 1
vaddpd zmm0, zmm0, zmm1
vextractf128 xmm1, ymm0, 1
vaddpd xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddsd xmm0, xmm0, xmm1
vzeroupper
ret
.Lfunc_end0:
.size julia_squared_norm_turbo_4421, .Lfunc_end0-julia_squared_norm_turbo_4421
.cfi_endproc
# -- End function
.section ".note.GNU-stack","",@progbits For .text
.file "squared_norm_turbo"
.globl julia_squared_norm_turbo_4284 # -- Begin function julia_squared_norm_turbo_4284
.p2align 4, 0x90
.type julia_squared_norm_turbo_4284,@function
julia_squared_norm_turbo_4284: # @julia_squared_norm_turbo_4284
.cfi_startproc
# %bb.0: # %top
movabs rsi, 9223372036854775680
mov rax, qword ptr [rdi]
mov rdx, qword ptr [rdi + 8]
cmp rdx, 128
jae .LBB0_2
# %bb.1:
vxorps xmm0, xmm0, xmm0
xor ecx, ecx
vxorps xmm1, xmm1, xmm1
add rsi, 96
and rsi, rdx
cmp rcx, rsi
je .LBB0_8
jmp .LBB0_6
.LBB0_2: # %L30.preheader
mov rcx, rdx
and rcx, rsi
vxorps xmm1, xmm1, xmm1
xor edi, edi
vxorps xmm0, xmm0, xmm0
vxorps xmm3, xmm3, xmm3
vxorps xmm4, xmm4, xmm4
vxorps xmm2, xmm2, xmm2
vxorps xmm5, xmm5, xmm5
vxorps xmm6, xmm6, xmm6
vxorps xmm7, xmm7, xmm7
.p2align 4, 0x90
.LBB0_3: # %L49
# =>This Inner Loop Header: Depth=1
vmovups zmm8, zmmword ptr [rax + 4*rdi]
vmovups zmm9, zmmword ptr [rax + 4*rdi + 64]
vmovups zmm10, zmmword ptr [rax + 4*rdi + 128]
vmovups zmm11, zmmword ptr [rax + 4*rdi + 192]
vmovups zmm12, zmmword ptr [rax + 4*rdi + 256]
vmovups zmm13, zmmword ptr [rax + 4*rdi + 320]
vmovups zmm14, zmmword ptr [rax + 4*rdi + 384]
vmovups zmm15, zmmword ptr [rax + 4*rdi + 448]
vfmadd231ps zmm7, zmm8, zmm8 # zmm7 = (zmm8 * zmm8) + zmm7
vfmadd231ps zmm6, zmm9, zmm9 # zmm6 = (zmm9 * zmm9) + zmm6
vfmadd231ps zmm5, zmm10, zmm10 # zmm5 = (zmm10 * zmm10) + zmm5
vfmadd231ps zmm2, zmm11, zmm11 # zmm2 = (zmm11 * zmm11) + zmm2
vfmadd231ps zmm4, zmm12, zmm12 # zmm4 = (zmm12 * zmm12) + zmm4
vfmadd231ps zmm3, zmm13, zmm13 # zmm3 = (zmm13 * zmm13) + zmm3
vfmadd231ps zmm0, zmm14, zmm14 # zmm0 = (zmm14 * zmm14) + zmm0
vfmadd231ps zmm1, zmm15, zmm15 # zmm1 = (zmm15 * zmm15) + zmm1
sub rdi, -128
cmp rcx, rdi
jne .LBB0_3
# %bb.4: # %L79
vaddps zmm4, zmm7, zmm4
vaddps zmm3, zmm6, zmm3
vaddps zmm0, zmm5, zmm0
vaddps zmm0, zmm4, zmm0
vaddps zmm1, zmm2, zmm1
vaddps zmm1, zmm3, zmm1
add rsi, 96
and rsi, rdx
cmp rcx, rsi
je .LBB0_8
.p2align 4, 0x90
.LBB0_6: # %L114
# =>This Inner Loop Header: Depth=1
vmovups zmm2, zmmword ptr [rax + 4*rcx]
vmovups zmm3, zmmword ptr [rax + 4*rcx + 64]
vfmadd231ps zmm0, zmm2, zmm2 # zmm0 = (zmm2 * zmm2) + zmm0
vfmadd231ps zmm1, zmm3, zmm3 # zmm1 = (zmm3 * zmm3) + zmm1
add rcx, 32
cmp rsi, rcx
jne .LBB0_6
# %bb.7:
mov rcx, rsi
.LBB0_8: # %L123
cmp rcx, rdx
jae .LBB0_12
# %bb.9: # %L125
mov esi, edx
and esi, 15
mov edi, 16
cmovne edi, esi
mov esi, -1
bzhi esi, esi, edi
add rdx, -17
cmp rdx, rcx
jge .LBB0_11
# %bb.10: # %L136
kmovd k1, esi
vmovups zmm2 {k1} {z}, zmmword ptr [rax + 4*rcx]
vfmadd231ps zmm0 {k1}, zmm2, zmm2 # zmm0 {%k1} = (zmm2 * zmm2) + zmm0
jmp .LBB0_12
.LBB0_11: # %L143
vmovups zmm2, zmmword ptr [rax + 4*rcx]
kmovd k1, esi
vmovups zmm3 {k1} {z}, zmmword ptr [rax + 4*rcx + 64]
vfmadd231ps zmm0, zmm2, zmm2 # zmm0 = (zmm2 * zmm2) + zmm0
vfmadd231ps zmm1 {k1}, zmm3, zmm3 # zmm1 {%k1} = (zmm3 * zmm3) + zmm1
.LBB0_12: # %L152
vaddps zmm0, zmm0, zmm1
vextractf64x4 ymm1, zmm0, 1
vaddps zmm0, zmm0, zmm1
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddps xmm0, xmm0, xmm1
vmovshdup xmm1, xmm0 # xmm1 = xmm0[1,1,3,3]
vaddss xmm0, xmm0, xmm1
vzeroupper
ret
.Lfunc_end0:
.size julia_squared_norm_turbo_4284, .Lfunc_end0-julia_squared_norm_turbo_4284
.cfi_endproc
# -- End function
.section ".note.GNU-stack","",@progbits It unrolls by 8 and loops, reduces to 2 vectors and loops, and then handles the rest with branches and masking. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
|
I will ask Intel's LLVM team to take a look if it is beneficial to also unroll the AVX512 remainder vector loop by 4 so it matches AVX2 for small inputs. |
https://github.com/JuliaSIMD/LoopVectorization.jl |
If LoopVectorization.jl generates the SIMD code, then the unroll amount of <16, f32> loop, whether to generate <8, f32> remainder loop and how much to unroll it, and others are all decided by LoopVectorization.jl. LLVM's LoopVectorizePass is not used by Julia. Then, what is the effect of -mprefer-vector-width=512 to LLVM compiler in the following quote? "You need to start Julia with -C"native,-prefer-256-bit" (disable 256 bit preference, letting it use 512 bit) or if running C/C++/Fortran using gcc/clang compile with -mprefer-vector-width=512 (enable 512 bit preference, letting it use 512 bit)." My original impression of reading this quote is that LoopVectorization.jl generates a scalar LLVM loop with a Now, it seems that we only need to change LoopVectorization.jl such that the 512 bit vectorization code is no slower than 256 bit vectorization in general, if we ignore the effect of frequency throttling and array length test for entering the 512 bit loop. |
There are two different prices of code, pure Julia as is does use LLVM loop vectorized as part of our LLVM pipeline (I recommend taking a look at that of you are interested, since we don't re-use Clangs pipeline.) There are various Julia packages, LoopVectorization.jl being the most prominent that allow users to short-circuit to some extend the default pipeline and that implement their own scheduling. So for all the code that doesn't use LV.jl the LLVM options you mentioned is still necessary. So improving LLVM vectorization support will also be greatly beneficial for Julia |
This was the first thing I fixed in the rewrite) and causes excessive latency because of the way it's designed (also fixed). LoopVectorization.jl also doesn't do any real dependence analysis. LoopModels will ultimately replace LoopVectorization (but dependence analysis is actually the only thing it does so far); it'll work more like Enzyme + Enzyme.jl, by creating a custom LLVM pass pipeline that includes the LoopModels pass. Most vectorized Julia code today relies on LLVM's LoopVectorize, as vchuravy said.
LoopVectorization.jl is already generating 512 bit code that is (almost?) always faster than 256 bit versions. julia> @benchmark test_all_sizes!(squared_norm_simd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max): 4.649 μs … 6.464 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.679 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.696 μs ± 94.129 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▅▇█▅▄▅▇▃▁▁ ▁ ▂ ▂
███████████▅▁▁▁▁▁▁▄▅██▆▃▁▁▁▁▃▁▁▃▃▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▅██▇▆▇ █
4.65 μs Histogram: log(frequency) by time 5.09 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark test_all_sizes!(squared_norm_turbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.218 μs … 3.514 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.231 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.235 μs ± 43.030 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄ █▁
▂▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁█▁▁▁▁▁▁▁██▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂ ▂
2.22 μs Histogram: frequency by time 2.25 μs <
Memory estimate: 0 bytes, allocs estimate: 0. However, that's probably mostly because LLVM does not unroll aggressively on the M1 at all, despite the M1 having tons of execution resources. Of course, an average placing uniform weight on all sizes 1-256 isn't a perfect representation of average speed, and scalar loops on fallback would have the advantage of avoiding SIMD instructions altogether, which may be nice if they're called from largely non-SIMD code so that the units could potentially be turned off. I will, in a few months, start writing new SIMD code generation for LoopModels. |
Thank you! Let's take this offline to find a way to the goal of "always default to the max vector register size" in the vectorized code. |
Fixed by #46499 |
LLVM enabled AVX512FP16 instructions for native _Float16 support for SPR (https://reviews.llvm.org/D105263) and for SSE2 and above (https://reviews.llvm.org/D105263). Julia still converts Float16 to Float32 for X86 targets.
There have been similar Julia enabling work on A64fx. (#40216). Most of the work can be generalized to support AVX512FP16.
The intention of the open issue is to track the changes for _Float16 to work natively on AVX512FP16 enabled X86 targets.
The text was updated successfully, but these errors were encountered: