Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware Float16 on AVX512-FP16 for SPR #46498

Closed
pengtu opened this issue Aug 26, 2022 · 30 comments · Fixed by #46499
Closed

Hardware Float16 on AVX512-FP16 for SPR #46498

pengtu opened this issue Aug 26, 2022 · 30 comments · Fixed by #46499
Labels
compiler:simd instruction-level vectorization float16

Comments

@pengtu
Copy link

pengtu commented Aug 26, 2022

LLVM enabled AVX512FP16 instructions for native _Float16 support for SPR (https://reviews.llvm.org/D105263) and for SSE2 and above (https://reviews.llvm.org/D105263). Julia still converts Float16 to Float32 for X86 targets.

There have been similar Julia enabling work on A64fx. (#40216). Most of the work can be generalized to support AVX512FP16.

The intention of the open issue is to track the changes for _Float16 to work natively on AVX512FP16 enabled X86 targets.

@gbaraldi
Copy link
Member

#46499 should eventually fix it. I can't test the changes and there might be some ABI issues with Float16 but hopefully not.

@pengtu
Copy link
Author

pengtu commented Aug 26, 2022

I can test the changes on a pre-release SPR. Let me know if there are any specific tests to focus on.

@gbaraldi
Copy link
Member

I would like to see if versioninfo detects the cpu as a sapphire rapids, and if it emits native float16 instructions as expected.
I.e

versioninfo()

foo(a,b) = a + b
a = Float16(1)
b = Float16(2)


@code_llvm foo(a,b)
@code_native foo(a,b)

This with the pr checked out and built

@gbaraldi
Copy link
Member

gbaraldi commented Aug 26, 2022

And maybe a test to see if it vectorizes like

function vecmap(arr)
    out = similar(arr)
    @simd for i in eachindex(arr)
        out[i] = Float16(1.5)* arr[i]
    end
    out
end 

arr = rand(Float16, 10000)

@code_llvm vecmap(arr)

@pengtu
Copy link
Author

pengtu commented Aug 26, 2022

The cpu is detection of sapphirerapids is correct and the native float16 instructions are correctly emitted. However, it doesn't seem to vectorize.

julia> versioninfo()
Julia Version 1.9.0-DEV.1130
Commit d64b9d8ef4 (2022-08-26 18:36 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 224 × Genuine Intel(R) CPU 0000%@
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.5 (ORCJIT, sapphirerapids)

julia> @code_llvm f(a,b)
;  @ REPL[2]:1 within `f`
define half @julia_f_238(half %0, half %1) #0 {
top:
; ┌ @ float.jl:383 within `+`
   %2 = fadd half %0, %1
; └
  ret half %2
}
julia> @code_native f(a,b)
        .text
        .file   "f"
        .globl  julia_f_262                     # -- Begin function julia_f_262
        .p2align        4, 0x90
        .type   julia_f_262,@function
julia_f_262:                            # @julia_f_262
; ┌ @ REPL[2]:1 within `f`
        .cfi_startproc
# %bb.0:                                # %top
; │┌ @ float.jl:383 within `+`
        vaddsh  %xmm1, %xmm0, %xmm0
; │└
        retq
.Lfunc_end0:
        .size   julia_f_262, .Lfunc_end0-julia_f_262
        .cfi_endproc
; └
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits
julia> function vecmap(arr)
           out = similar(arr)
           @simd for i in eachindex(arr)
               out[i] = Float16(1.5)* arr[i]
           end
           out
       end
vecmap (generic function with 1 method)

julia> arr = rand(Float16, 10000)
10000-element Vector{Float16}:
 0.03564
 0.4238
 0.2988
 0.0454
 0.9985
 0.4468
 0.4312
 
 0.0801
 0.4556
 0.562
 0.435
 0.708
 0.6074
 0.5283

julia> @code_llvm vecmap(arr)
;  @ REPL[8]:1 within `vecmap`
define nonnull {}* @julia_vecmap_306({}* noundef nonnull align 16 dereferenceable(40) %0) #0 {
top:
;  @ REPL[8]:2 within `vecmap`
; ┌ @ array.jl:369 within `similar`
; │┌ @ array.jl:148 within `size`
    %1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
    %2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
    %3 = load i64, i64* %2, align 8
; │└
; │┌ @ boot.jl:475 within `Array`
    %4 = call nonnull {}* inttoptr (i64 140397938503504 to {}* ({}*, i64)*)({}* inttoptr (i64 140397806285568 to {}*), i64 %3)
; └└
;  @ REPL[8]:3 within `vecmap`
; ┌ @ simdloop.jl:69 within `macro expansion`
; │┌ @ abstractarray.jl:297 within `eachindex`
; ││┌ @ abstractarray.jl:128 within `axes1`
; │││┌ @ abstractarray.jl:95 within `axes`
; ││││┌ @ array.jl:149 within `size`
       %5 = load i64, i64* %2, align 8
; │└└└└
; │ @ simdloop.jl:72 within `macro expansion`
; │┌ @ int.jl:83 within `<`
    %.not = icmp eq i64 %5, 0
; │└
   br i1 %.not, label %L33, label %L15.preheader

L15.preheader:                                    ; preds = %top
   %6 = bitcast {}* %4 to { i8*, i64, i16, i16, i32 }*
   %7 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %6, i64 0, i32 1
   %8 = bitcast {}* %0 to half**
   %9 = load half*, half** %8, align 8
   %10 = bitcast {}* %4 to half**
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[8]:4
; │┌ @ array.jl:957 within `setindex!`
    %.pre = load i64, i64* %7, align 8
; │└
; │┌ @ essentials.jl:13 within `getindex`
    br label %idxend

L33:                                              ; preds = %idxend2, %top
; └└
;  @ REPL[8]:6 within `vecmap`
  ret {}* %4

idxend:                                           ; preds = %idxend2, %L15.preheader
  %value_phi8 = phi i64 [ %11, %idxend2 ], [ 0, %L15.preheader ]
;  @ REPL[8]:3 within `vecmap`
; ┌ @ simdloop.jl:76 within `macro expansion`
; │┌ @ simdloop.jl:54 within `simd_index`
; ││┌ @ int.jl:87 within `+`
     %11 = add nuw nsw i64 %value_phi8, 1
; │└└
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[8]:4
; │┌ @ array.jl:957 within `setindex!`
    %12 = icmp ult i64 %value_phi8, %.pre
    br i1 %12, label %idxend2, label %oob1

oob1:                                             ; preds = %idxend
    %13 = alloca i64, align 8
    store i64 %11, i64* %13, align 8
    call void @ijl_bounds_error_ints({}* %4, i64* nonnull %13, i64 1)
    unreachable

idxend2:                                          ; preds = %idxend
; │└
; │┌ @ essentials.jl:13 within `getindex`
    %14 = getelementptr inbounds half, half* %9, i64 %value_phi8
    %15 = load half, half* %14, align 2
; │└
; │┌ @ float.jl:385 within `*`
    %16 = fmul half %15, 0xH3E00
; │└
; │┌ @ array.jl:957 within `setindex!`
    %17 = load half*, half** %10, align 8
    %18 = getelementptr inbounds half, half* %17, i64 %value_phi8
    store half %16, half* %18, align 2
; │└
; │ @ simdloop.jl:75 within `macro expansion`
   %exitcond.not = icmp eq i64 %11, %5
   br i1 %exitcond.not, label %L33, label %idxend
; └
}
julia> @code_native vecmap(arr)
        .text
        .file   "vecmap"
        .section        .rodata,"a",@progbits
        .p2align        1                               # -- Begin function julia_vecmap_309
.LCPI0_0:
        .short  0x3e00                          # half 1.5
        .text
        .globl  julia_vecmap_309
        .p2align        4, 0x90
        .type   julia_vecmap_309,@function
julia_vecmap_309:                       # @julia_vecmap_309
; ┌ @ REPL[8]:1 within `vecmap`
        .cfi_startproc
# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        pushq   %rbx
        subq    $8, %rsp
        .cfi_offset %rbx, -24
        movq    %rdi, %rbx
; │ @ REPL[8]:2 within `vecmap`
; │┌ @ array.jl:369 within `similar`
; ││┌ @ array.jl:148 within `size`
        movq    8(%rdi), %rsi
        movabsq $140397806285568, %rdi          # imm = 0x7FB0E95EF300
        movabsq $140397938503504, %rax          # imm = 0x7FB0F1406F50
; ││└
; ││┌ @ boot.jl:475 within `Array`
        callq   *%rax
; │└└
; │ @ REPL[8]:3 within `vecmap`
; │┌ @ simdloop.jl:69 within `macro expansion`
; ││┌ @ abstractarray.jl:297 within `eachindex`
; │││┌ @ abstractarray.jl:128 within `axes1`
; ││││┌ @ abstractarray.jl:95 within `axes`
; │││││┌ @ array.jl:149 within `size`
        movq    8(%rbx), %rdx
; ││└└└└
; ││ @ simdloop.jl:72 within `macro expansion`
; ││┌ @ int.jl:83 within `<`
        testq   %rdx, %rdx
; ││└
        je      .LBB0_8
# %bb.1:                                # %L15.preheader
        movq    (%rbx), %rsi
; ││ @ simdloop.jl:77 within `macro expansion` @ REPL[8]:4
; ││┌ @ array.jl:957 within `setindex!`
        movq    8(%rax), %rdi
        xorl    %ecx, %ecx
        movabsq $.LCPI0_0, %rbx
        vmovsh  (%rbx), %xmm0
        .p2align        4, 0x90
.LBB0_2:                                # %idxend
                                        # =>This Inner Loop Header: Depth=1
        cmpq    %rdi, %rcx
        jae     .LBB0_3
# %bb.7:                                # %idxend2
                                        #   in Loop: Header=BB0_2 Depth=1
; ││└
; ││┌ @ float.jl:385 within `*`
        vmulsh  (%rsi,%rcx,2), %xmm0, %xmm1
; ││└
; ││┌ @ array.jl:957 within `setindex!`
        movq    (%rax), %rbx
        vmovsh  %xmm1, (%rbx,%rcx,2)
; ││└
; ││ @ simdloop.jl within `macro expansion`
        leaq    1(%rcx), %rbx
        movq    %rbx, %rcx
; ││ @ simdloop.jl:75 within `macro expansion`
        cmpq    %rbx, %rdx
        jne     .LBB0_2
.LBB0_8:                                # %L33
; │└
; │ @ REPL[8]:6 within `vecmap`
        leaq    -8(%rbp), %rsp
        popq    %rbx
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq
.LBB0_3:                                # %oob1
; │ @ REPL[8]:3 within `vecmap`
; │┌ @ simdloop.jl:77 within `macro expansion` @ REPL[8]:4
; ││┌ @ array.jl:957 within `setindex!`
        .cfi_def_cfa %rbp, 16
        movq    %rsp, %rsi
        movl    $16, %edx
        subq    %rdx, %rsi
        cmpq    %rsp, %rsi
        jge     .LBB0_6
.LBB0_5:                                # %oob1
                                        # =>This Inner Loop Header: Depth=1
        xorq    $0, (%rsp)
        subq    $4096, %rsp                     # imm = 0x1000
        cmpq    %rsp, %rsi
        jl      .LBB0_5
.LBB0_6:                                # %oob1
        movq    %rsi, %rsp
        incq    %rcx
        movq    %rcx, (%rsi)
        movabsq $ijl_bounds_error_ints, %rcx
        movl    $1, %edx
        movq    %rax, %rdi
        callq   *%rcx
.Lfunc_end0:
        .size   julia_vecmap_309, .Lfunc_end0-julia_vecmap_309
        .cfi_endproc
; └└└
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

@gbaraldi
Copy link
Member

My vectorizable function wasn't vectorizable at all 😄 .
Could you try with

function vecmap(arr)
           out = similar(arr)
           @inbounds @simd for i in eachindex(arr)
               out[i] = Float16(1.5)* arr[i]
           end
           out
       end

arr = rand(Float16, 10000)

@code_llvm vecmap(arr)

This vectorizes on an M1 mac

@pengtu
Copy link
Author

pengtu commented Aug 27, 2022

Yes, this one is vectorized. Look at this beauty 😊

julia> @code_llvm vecmap(arr)
;  @ REPL[1]:1 within `vecmap`

define nonnull {}* @julia_vecmap_140({}* noundef nonnull align 16 dereferenceable(40) %0) #0 {
top:
;  @ REPL[1]:2 within `vecmap`
; ┌ @ array.jl:369 within `similar`
; │┌ @ array.jl:148 within `size`
    %1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
    %2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
    %3 = load i64, i64* %2, align 8
; │└
; │┌ @ boot.jl:475 within `Array`
    %4 = call nonnull {}* inttoptr (i64 139892987219792 to {}* ({}*, i64)*)({}* inttoptr (i64 139892815207312 to {}*), i64 %3)
; └└
;  @ REPL[1]:3 within `vecmap`
; ┌ @ simdloop.jl:69 within `macro expansion`
; │┌ @ abstractarray.jl:297 within `eachindex`
; ││┌ @ abstractarray.jl:128 within `axes1`
; │││┌ @ abstractarray.jl:95 within `axes`
; ││││┌ @ array.jl:149 within `size`
       %5 = load i64, i64* %2, align 8
; │└└└└
; │ @ simdloop.jl:72 within `macro expansion`
; │┌ @ int.jl:83 within `<`
    %.not = icmp eq i64 %5, 0
; │└
   br i1 %.not, label %L33, label %iter.check

iter.check:                                       ; preds = %top
   %6 = bitcast {}* %0 to half**
   %7 = load half*, half** %6, align 8
   %8 = bitcast {}* %4 to half**
   %9 = load half*, half** %8, align 8
; │ @ simdloop.jl:75 within `macro expansion`
   %min.iters.check = icmp ult i64 %5, 8
   br i1 %min.iters.check, label %vec.epilog.scalar.ph, label %vector.memcheck

vector.memcheck:                                  ; preds = %iter.check
   %scevgep = getelementptr half, half* %9, i64 %5
   %scevgep7 = getelementptr half, half* %7, i64 %5
   %bound0 = icmp ult half* %9, %scevgep7
   %bound1 = icmp ult half* %7, %scevgep
   %found.conflict = and i1 %bound0, %bound1
   br i1 %found.conflict, label %vec.epilog.scalar.ph, label %vector.main.loop.iter.check

vector.main.loop.iter.check:                      ; preds = %vector.memcheck
   %min.iters.check9 = icmp ult i64 %5, 64
   br i1 %min.iters.check9, label %vec.epilog.ph, label %vector.ph

vector.ph:                                        ; preds = %vector.main.loop.iter.check
   %n.vec = and i64 %5, 9223372036854775744
   br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
    %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
    %10 = getelementptr inbounds half, half* %7, i64 %index
; │└
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[1]:4
; │┌ @ essentials.jl:13 within `getindex`
    %11 = bitcast half* %10 to <16 x half>*
    %wide.load = load <16 x half>, <16 x half>* %11, align 2
    %12 = getelementptr inbounds half, half* %10, i64 16
    %13 = bitcast half* %12 to <16 x half>*
    %wide.load10 = load <16 x half>, <16 x half>* %13, align 2
    %14 = getelementptr inbounds half, half* %10, i64 32
    %15 = bitcast half* %14 to <16 x half>*
    %wide.load11 = load <16 x half>, <16 x half>* %15, align 2
    %16 = getelementptr inbounds half, half* %10, i64 48
    %17 = bitcast half* %16 to <16 x half>*
    %wide.load12 = load <16 x half>, <16 x half>* %17, align 2
; │└
; │┌ @ float.jl:385 within `*`
    %18 = fmul <16 x half> %wide.load, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
    %19 = fmul <16 x half> %wide.load10, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
    %20 = fmul <16 x half> %wide.load11, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
    %21 = fmul <16 x half> %wide.load12, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
; │└
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
    %22 = getelementptr inbounds half, half* %9, i64 %index
; │└
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[1]:4
; │┌ @ array.jl:957 within `setindex!`
    %23 = bitcast half* %22 to <16 x half>*
    store <16 x half> %18, <16 x half>* %23, align 2
    %24 = getelementptr inbounds half, half* %22, i64 16
    %25 = bitcast half* %24 to <16 x half>*
    store <16 x half> %19, <16 x half>* %25, align 2
    %26 = getelementptr inbounds half, half* %22, i64 32
    %27 = bitcast half* %26 to <16 x half>*
    store <16 x half> %20, <16 x half>* %27, align 2
    %28 = getelementptr inbounds half, half* %22, i64 48
    %29 = bitcast half* %28 to <16 x half>*
    store <16 x half> %21, <16 x half>* %29, align 2
; │└
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
    %index.next = add nuw i64 %index, 64
    %30 = icmp eq i64 %index.next, %n.vec
    br i1 %30, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
; │└
; │ @ simdloop.jl:75 within `macro expansion`
   %cmp.n = icmp eq i64 %5, %n.vec
   br i1 %cmp.n, label %L33, label %vec.epilog.iter.check

vec.epilog.iter.check:                            ; preds = %middle.block
   %n.vec.remaining = and i64 %5, 56
   %min.epilog.iters.check = icmp eq i64 %n.vec.remaining, 0
   br i1 %min.epilog.iters.check, label %vec.epilog.scalar.ph, label %vec.epilog.ph

vec.epilog.ph:                                    ; preds = %vec.epilog.iter.check, %vector.main.loop.iter.check
   %vec.epilog.resume.val = phi i64 [ %n.vec, %vec.epilog.iter.check ], [ 0, %vector.main.loop.iter.check ]
   %n.vec14 = and i64 %5, 9223372036854775800
   br label %vec.epilog.vector.body

vec.epilog.vector.body:                           ; preds = %vec.epilog.vector.body, %vec.epilog.ph
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
    %index16 = phi i64 [ %vec.epilog.resume.val, %vec.epilog.ph ], [ %index.next18, %vec.epilog.vector.body ]
    %31 = getelementptr inbounds half, half* %7, i64 %index16
; │└
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[1]:4
; │┌ @ essentials.jl:13 within `getindex`
    %32 = bitcast half* %31 to <8 x half>*
    %wide.load17 = load <8 x half>, <8 x half>* %32, align 2
; │└
; │┌ @ float.jl:385 within `*`
    %33 = fmul <8 x half> %wide.load17, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
; │└
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
    %34 = getelementptr inbounds half, half* %9, i64 %index16
; │└
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[1]:4
; │┌ @ array.jl:957 within `setindex!`
    %35 = bitcast half* %34 to <8 x half>*
    store <8 x half> %33, <8 x half>* %35, align 2
; │└
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
    %index.next18 = add nuw i64 %index16, 8
    %36 = icmp eq i64 %index.next18, %n.vec14
    br i1 %36, label %vec.epilog.middle.block, label %vec.epilog.vector.body

vec.epilog.middle.block:                          ; preds = %vec.epilog.vector.body
; │└
; │ @ simdloop.jl:75 within `macro expansion`
   %cmp.n15 = icmp eq i64 %5, %n.vec14
   br i1 %cmp.n15, label %L33, label %vec.epilog.scalar.ph

vec.epilog.scalar.ph:                             ; preds = %vec.epilog.middle.block, %vec.epilog.iter.check, %vector.memcheck, %iter.check
   %bc.resume.val = phi i64 [ %n.vec14, %vec.epilog.middle.block ], [ %n.vec, %vec.epilog.iter.check ], [ 0, %vector.memcheck ], [ 0, %iter.check ]
   br label %L15

L15:                                              ; preds = %L15, %vec.epilog.scalar.ph
   %value_phi5 = phi i64 [ %bc.resume.val, %vec.epilog.scalar.ph ], [ %41, %L15 ]
; │ @ simdloop.jl:77 within `macro expansion` @ REPL[1]:4
; │┌ @ essentials.jl:13 within `getindex`
    %37 = getelementptr inbounds half, half* %7, i64 %value_phi5
    %38 = load half, half* %37, align 2
; │└
; │┌ @ float.jl:385 within `*`
    %39 = fmul half %38, 0xH3E00
; │└
; │┌ @ array.jl:957 within `setindex!`
    %40 = getelementptr inbounds half, half* %9, i64 %value_phi5
    store half %39, half* %40, align 2
; │└
; │ @ simdloop.jl:78 within `macro expansion`
; │┌ @ int.jl:87 within `+`
    %41 = add nuw nsw i64 %value_phi5, 1
; │└
; │ @ simdloop.jl:75 within `macro expansion`
; │┌ @ int.jl:83 within `<`
    %exitcond.not = icmp eq i64 %41, %5
; │└
   br i1 %exitcond.not, label %L33, label %L15

L33:                                              ; preds = %L15, %vec.epilog.middle.block, %middle.block, %top
; └
;  @ REPL[1]:6 within `vecmap`
  ret {}* %4
}

Thanks!

@giordano
Copy link
Contributor

Last time I played with julia nightly on a64fx a couple of months ago I was able to get vscale instructions even without @inbounds @simd, which was quite impressive. I wonder if llvm autovectorisation on ARM is more aggressive than on x86_64?

@pengtu
Copy link
Author

pengtu commented Aug 27, 2022

Also note that the vector length is <16 x half> while AVX512-FP16 can do <32 x half>.

@gbaraldi
Copy link
Member

That is odd it chose to generate <16 x half> vectors. Does clang do 32 size vectors for similar code in C/C++?

@pengtu
Copy link
Author

pengtu commented Aug 29, 2022

LLVM 14 vectorized to <32 x half>:

void foo(__fp16 *a)
{
        __fp16 c = 1.5;
        for (int i = 0; i < 10000; i++) {
                a[i] = c*a[i];
        }
}
        .text
        .file   "test.c"
        .section        .rodata,"a",@progbits
        .p2align        1                               # -- Begin function foo
.LCPI0_0:
        .short  0x3e00                          # half 1.5
        .text
        .globl  foo
        .p2align        4, 0x90
        .type   foo,@function
foo:                                    # @foo
        .cfi_startproc
# %bb.0:                                # %iter.check
        movl    $224, %eax
        vpbroadcastw    .LCPI0_0(%rip), %zmm0   # zmm0 = [1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+
0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+0,1.5E+
0,1.5E+0,1.5E+0]
        .p2align        4, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        vmulph  -448(%rdi,%rax,2), %zmm0, %zmm1
        vmulph  -384(%rdi,%rax,2), %zmm0, %zmm2
        vmulph  -320(%rdi,%rax,2), %zmm0, %zmm3
        vmulph  -256(%rdi,%rax,2), %zmm0, %zmm4
        vmovups %zmm1, -448(%rdi,%rax,2)
        vmovups %zmm2, -384(%rdi,%rax,2)
        vmovups %zmm3, -320(%rdi,%rax,2)
        vmovups %zmm4, -256(%rdi,%rax,2)
        vmulph  -192(%rdi,%rax,2), %zmm0, %zmm1
        vmulph  -128(%rdi,%rax,2), %zmm0, %zmm2
        vmulph  -64(%rdi,%rax,2), %zmm0, %zmm3
        vmulph  (%rdi,%rax,2), %zmm0, %zmm4
        vmovups %zmm1, -192(%rdi,%rax,2)
        vmovups %zmm2, -128(%rdi,%rax,2)
        vmovups %zmm3, -64(%rdi,%rax,2)
        vmovups %zmm4, (%rdi,%rax,2)
        addq    $256, %rax                      # imm = 0x100
        cmpq    $10208, %rax                    # imm = 0x27E0
        jne     .LBB0_1
# %bb.2:                                # %vec.epilog.vector.body
        vmovups 19968(%rdi), %ymm0
        vmulph  .LCPI0_0(%rip){1to16}, %ymm0, %ymm0
        vmovups %ymm0, 19968(%rdi)
        vzeroupper
        retq
.Lfunc_end0:
        .size   foo, .Lfunc_end0-foo
        .cfi_endproc
                                        # -- End function
        .ident  "clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig
*** IR Dump After LoopVectorizePass on foo ***
; Function Attrs: nofree norecurse nosync nounwind uwtable
define dso_local void @foo(half* nocapture noundef %0) local_unnamed_addr #0 {
iter.check:
  br i1 false, label %vec.epilog.scalar.ph, label %vector.main.loop.iter.check

vector.main.loop.iter.check:                      ; preds = %iter.check
  br i1 false, label %vec.epilog.ph, label %vector.ph

vector.ph:                                        ; preds = %vector.main.loop.iter.check
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %1 = add i64 %index, 0
  %2 = add i64 %index, 32
  %3 = add i64 %index, 64
  %4 = add i64 %index, 96
  %5 = getelementptr inbounds half, half* %0, i64 %1
  %6 = getelementptr inbounds half, half* %0, i64 %2
  %7 = getelementptr inbounds half, half* %0, i64 %3
  %8 = getelementptr inbounds half, half* %0, i64 %4
  %9 = getelementptr inbounds half, half* %5, i32 0
  %10 = bitcast half* %9 to <32 x half>*
  %wide.load = load <32 x half>, <32 x half>* %10, align 2, !tbaa !3
  %11 = getelementptr inbounds half, half* %5, i32 32
  %12 = bitcast half* %11 to <32 x half>*
  %wide.load11 = load <32 x half>, <32 x half>* %12, align 2, !tbaa !3
  %13 = getelementptr inbounds half, half* %5, i32 64
  %14 = bitcast half* %13 to <32 x half>*
  %wide.load12 = load <32 x half>, <32 x half>* %14, align 2, !tbaa !3
  %15 = getelementptr inbounds half, half* %5, i32 96
  %16 = bitcast half* %15 to <32 x half>*
  %wide.load13 = load <32 x half>, <32 x half>* %16, align 2, !tbaa !3
  %17 = fmul <32 x half> %wide.load, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
  %18 = fmul <32 x half> %wide.load11, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
  %19 = fmul <32 x half> %wide.load12, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
  %20 = fmul <32 x half> %wide.load13, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
  %21 = bitcast half* %9 to <32 x half>*
  store <32 x half> %17, <32 x half>* %21, align 2, !tbaa !3
  %22 = bitcast half* %11 to <32 x half>*
  store <32 x half> %18, <32 x half>* %22, align 2, !tbaa !3
  %23 = bitcast half* %13 to <32 x half>*
  store <32 x half> %19, <32 x half>* %23, align 2, !tbaa !3
  %24 = bitcast half* %15 to <32 x half>*
  store <32 x half> %20, <32 x half>* %24, align 2, !tbaa !3
  %index.next = add nuw i64 %index, 128
  %25 = icmp eq i64 %index.next, 9984
  br i1 %25, label %middle.block, label %vector.body, !llvm.loop !7

middle.block:                                     ; preds = %vector.body
  %cmp.n = icmp eq i64 10000, 9984
  br i1 %cmp.n, label %33, label %vec.epilog.iter.check

vec.epilog.iter.check:                            ; preds = %middle.block
  br i1 false, label %vec.epilog.scalar.ph, label %vec.epilog.ph

vec.epilog.ph:                                    ; preds = %vector.main.loop.iter.check, %vec.epilog.iter.check
  %vec.epilog.resume.val = phi i64 [ 9984, %vec.epilog.iter.check ], [ 0, %vector.main.loop.iter.check ]
  br label %vec.epilog.vector.body

vec.epilog.vector.body:                           ; preds = %vec.epilog.vector.body, %vec.epilog.ph
  %index15 = phi i64 [ %vec.epilog.resume.val, %vec.epilog.ph ], [ %index.next17, %vec.epilog.vector.body ]
  %26 = add i64 %index15, 0
  %27 = getelementptr inbounds half, half* %0, i64 %26
  %28 = getelementptr inbounds half, half* %27, i32 0
  %29 = bitcast half* %28 to <16 x half>*
  %wide.load16 = load <16 x half>, <16 x half>* %29, align 2, !tbaa !3
  %30 = fmul <16 x half> %wide.load16, <half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00, half 0xH3E00>
  %31 = bitcast half* %28 to <16 x half>*
  store <16 x half> %30, <16 x half>* %31, align 2, !tbaa !3
  %index.next17 = add nuw i64 %index15, 16
  %32 = icmp eq i64 %index.next17, 10000
  br i1 %32, label %vec.epilog.middle.block, label %vec.epilog.vector.body, !llvm.loop !10

vec.epilog.middle.block:                          ; preds = %vec.epilog.vector.body
  %cmp.n14 = icmp eq i64 10000, 10000
  br i1 %cmp.n14, label %.loopexit, label %vec.epilog.scalar.ph

vec.epilog.scalar.ph:                             ; preds = %iter.check, %vec.epilog.iter.check, %vec.epilog.middle.block
  %bc.resume.val = phi i64 [ 10000, %vec.epilog.middle.block ], [ 9984, %vec.epilog.iter.check ], [ 0, %iter.check ]
  br label %34

.loopexit:                                        ; preds = %vec.epilog.middle.block, %34
  br label %33

33:                                               ; preds = %.loopexit, %middle.block
  ret void

34:                                               ; preds = %vec.epilog.scalar.ph, %34
  %indvars.iv = phi i64 [ %bc.resume.val, %vec.epilog.scalar.ph ], [ %indvars.iv.next, %34 ]
  %35 = getelementptr inbounds half, half* %0, i64 %indvars.iv
  %36 = load half, half* %35, align 2, !tbaa !3
  %37 = fmul half %36, 0xH3E00
  store half %37, half* %35, align 2, !tbaa !3
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 10000
  br i1 %exitcond.not, label %.loopexit, label %34, !llvm.loop !12
}

@gbaraldi
Copy link
Member

@chriselrod Any idea why it would do that?

@chriselrod
Copy link
Contributor

chriselrod commented Aug 29, 2022

@chriselrod Any idea why it would do that?

You need to start Julia with -C"native,-prefer-256-bit" (disable 256 bit preference, letting it use 512 bit) or if running C/C++/Fortran using gcc/clang compile with -mprefer-vector-width=512 (enable 512 bit preference, letting it use 512 bit).

E.g., I tend to start Julia with something like

julia -O3 -C"native,-prefer-256-bit" -q --project=~/path/to/project -t18

@gbaraldi
Copy link
Member

Why do we not default to the max size?

@chriselrod
Copy link
Contributor

chriselrod commented Aug 30, 2022

Why do we not default to the max size?

My flippant response would be "because LLVM is bad at SIMD with large vectors".

For example, you can try this script.

using LoopVectorization, Random, BenchmarkTools, LinuxPerf

@inline function squared_norm_simd(x)
  s = zero(eltype(x))
  @simd for i = eachindex(x)
    @inbounds s += x[i]^2
  end
  s
end
@inline function squared_norm_turbo(x)
  s = zero(eltype(x))
  @turbo for i = eachindex(x)
    s += x[i]^2
  end
  s
end

function test_all_sizes!(f::F, y, x, Ns) where {F}
  for n = eachindex(Ns)
    y[n] = f(@view(x[1:Ns[n]]))
  end
end

foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> f(arg1, args...), 1:N)

x = rand(256); # specify the max problem size (and number of problems) here
y = similar(x);
Ns = randperm(Random.Xoshiro(123), length(x));

@benchmark test_all_sizes!(squared_norm_simd, $y, $x, $Ns)
@benchmark test_all_sizes!(squared_norm_turbo, $y, $x, $Ns)

@pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
  @time foreachf(test_all_sizes!, 10, squared_norm_simd, y, x, Ns)
end
@pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
  @time foreachf(test_all_sizes!, 10, squared_norm_turbo, y, x, Ns)
end

@pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
  @time foreachf(test_all_sizes!, 1_000_000, squared_norm_simd, y, x, Ns)
end
@pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
  @time foreachf(test_all_sizes!, 1_000_000, squared_norm_turbo, y, x, Ns)
end

What this does is benchmarks a squared norm (or dot product of a vector with itself) of all sizes 1,...,N in a random order, to give us an idea of how the function performs over a wide range of problem sizes.

Note that if you adjust the max problem size, you'll probably also want to adjust the number of iterations used in foreachf. I normally aim for a few seconds or so of runtime.
You do want to keep the number of iterations equal across your comparisons, so that you're comparing stats for the same number of evaluations. Therefore, I do not recommend @benchmark inside of @pstats.

Results using a 10980XE CPU, under default settings (note that LoopVectorization.jl always prefers the maximum vector width):

julia> @benchmark test_all_sizes!(squared_norm_simd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min  max):  3.192 μs   4.551 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     3.208 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.215 μs ± 32.882 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▅█▇
  ▂▃████▅▃▂▃▃▃▃▃▃▃▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  3.19 μs        Histogram: frequency by time        3.38 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test_all_sizes!(squared_norm_turbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.753 μs   5.131 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.758 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.762 μs ± 47.027 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █
  ▃█▆▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂ ▂
  1.75 μs        Histogram: frequency by time        1.92 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         @time foreachf(test_all_sizes!, 1_000_000, squared_norm_simd, y, x, Ns)
       end
  3.239151 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               1.36e+10   33.3%  #  4.2 cycles per ns
└ task-clock               3.24e+09   33.3%  #  3.2 s
┌ instructions             4.26e+10   66.7%  #  3.1 insns per cycle
│ branch-instructions      5.75e+09   66.7%  # 13.5% of insns
└ branch-misses            9.62e+06   66.7%  #  0.2% of branch insns
┌ L1-dcache-load-misses    8.52e+04   33.3%  #  0.0% of dcache loads
│ L1-dcache-loads          1.01e+10   33.3%
│ cache-misses             4.80e+01   33.3%  # 94.1% of cache refs
└ cache-references         5.10e+01   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         @time foreachf(test_all_sizes!, 1_000_000, squared_norm_turbo, y, x, Ns)
       end
  1.798680 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               7.17e+09   33.3%  #  4.0 cycles per ns
└ task-clock               1.80e+09   33.3%  #  1.8 s
┌ instructions             2.76e+10   66.7%  #  3.8 insns per cycle
│ branch-instructions      2.87e+09   66.7%  # 10.4% of insns
└ branch-misses            3.09e+04   66.7%  #  0.0% of branch insns
┌ L1-dcache-load-misses    8.26e+04   33.3%  #  0.0% of dcache loads
│ L1-dcache-loads          4.97e+09   33.3%
│ cache-misses             2.30e+03   33.3%  # 23.3% of cache refs
└ cache-references         9.87e+03   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Using 512 bit vectors, LoopVectorization.jl was around 80% faster!
A couple things to note from the @pstats: the @simd version ran at 4.2 GHz, while the @turbo version down-clocked to 4 GHz because they were using 256 bit vs 512 bit instructions, respectively.
@simd required about 50% more instructions than @turbo to evaluate the loop, and also had about 20% fewer instructions per clock cycle.

So, let's try Julia with -C"native,-prefer-256-bit" so the @simd version can also use 512 bit vectors...

julia> @benchmark test_all_sizes!(squared_norm_simd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min  max):  4.292 μs   8.629 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     4.329 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.334 μs ± 61.638 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▁▂▅▅▇█▃
  ▂▂▃▆█████████▇▄▃▂▂▁▁▁▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  4.29 μs        Histogram: frequency by time        4.52 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test_all_sizes!(squared_norm_turbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.884 μs   5.338 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.889 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.892 μs ± 39.159 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▇█▆                                                       ▂
  ████▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▅▆▇██ █
  1.88 μs      Histogram: log(frequency) by time     2.02 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         @time foreachf(test_all_sizes!, 1_000_000, squared_norm_simd, y, x, Ns)
       end
  4.359129 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               1.74e+10   33.3%  #  4.0 cycles per ns
└ task-clock               4.36e+09   33.3%  #  4.4 s
┌ instructions             4.17e+10   66.7%  #  2.4 insns per cycle
│ branch-instructions      6.78e+09   66.7%  # 16.3% of insns
└ branch-misses            4.92e+07   66.7%  #  0.7% of branch insns
┌ L1-dcache-load-misses    3.64e+05   33.3%  #  0.0% of dcache loads
│ L1-dcache-loads          8.07e+09   33.3%
│ cache-misses             1.20e+01   33.3%  # 13.8% of cache refs
└ cache-references         8.70e+01   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         @time foreachf(test_all_sizes!, 1_000_000, squared_norm_turbo, y, x, Ns)
       end
  1.931807 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               7.70e+09   33.3%  #  4.0 cycles per ns
└ task-clock               1.93e+09   33.3%  #  1.9 s
┌ instructions             2.76e+10   66.7%  #  3.6 insns per cycle
│ branch-instructions      2.87e+09   66.7%  # 10.4% of insns
└ branch-misses            2.77e+04   66.7%  #  0.0% of branch insns
┌ L1-dcache-load-misses    1.09e+05   33.3%  #  0.0% of dcache loads
│ L1-dcache-loads          4.97e+09   33.3%
│ cache-misses             6.90e+01   33.3%  # 40.4% of cache refs
└ cache-references         1.71e+02   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The @simd version is now actually slower with 512 bit vectors than it was with 256 bit!!!
The clock speed dropped to 4 GHz, matching @turbo because both are using 512 bit instructions.
However, the @simd barely had a reduction in number of instructions it actually required to execute the code, while the instructions per clock dropper precipitously.

So, my response to "why does LLVM use 256 bit vectors by default?" is "because LLVM is so bad at generating code for 512 bit vectors even in trivial cases that it actually runs faster generating 256 bit code instead, even though if you actually do a good job, you'll get the expected performance boost from bigger vectors".

Of course, LoopVectorization.jl has tons of issues, which is why I've been working on LoopModels, which will work at the LLVM level (side stepping most of the issues) and include dependence analysis to actually generate correct code in the face of dependencies, instead of optimistically assuming code is free of dependencies aside from obvious reductions (the dependence analysis has already been implemented, I'm starting on cost modeling now, which is of course a part LoopVectorization.jl does reasonably well, but now with the benefit of access to LLVM's much more accurate costs instead of the table LoopVectorization.jl uses).

@gbaraldi
Copy link
Member

I see. That's annoying, is that a scheduler issue or just bad codegen overall?

@chriselrod
Copy link
Contributor

chriselrod commented Aug 30, 2022

I see. That's annoying, is that a scheduler issue or just bad codegen overall?

Bad codegen overall.
Specifically, it generates a SIMD loop unrolled 4x, and then a scalar loop to do the cleanup, regardless of vector width on x86.
length(x) % 32 is larger than length(x) % 16, so larger vectors result in lots of unoptimized scalar iterations that suffer from a dependency chain, which is why the IPC is so bad.

EDIT:
To expand a little more, note that the unrolled SIMD loop has the same throughput as the scalar loop. Thus, the total number of unrolled vector iterations + number of scalar iterations is a decent proxy for how long the loop takes to execute; having a lot of scalar iterations is bad.
(Note though that if there's > 0 vector iterations, you do also need to reduce the vectors to a scalar, which adds some overhead)

@pengtu
Copy link
Author

pengtu commented Aug 30, 2022

Digging a little bit more on the LLVM vectorizer on 512 bit vector, it generates 3 loops:

  1. <16 x f32> vector loop unrolled by 4
  2. <8 x f32> vector loop not unrolled
  3. scalar loop

On 256 bit vector, it generates:

  1. <8 x f32> vector loop unrolled by 4
  2. scalar loop

The number of iterations of final scalar loop is the same for both cases. The performance differences are coming from following two differences:

  1. The <8 x f32> loop is not unrolled by 4 on 512 vector: for input array length less than 64: (16 width x 4 unroll), they will all fall into the <8 x f32> unrolled loop; array length 65-127, 64 elements in the <16xf32> loop, and 1-63 elements in <8 x 32> unrolled loop.
  2. The reduction of <16 x 32> to a scalar on the 512 bit vector. The difference is minor here. Still it is probably better to add the upper and lower part of the 512 vector into a 256 bit vector, and jump to the <8 x f32> reduction code to reduce it to a scalar.

To reproduce:

float squared_norm_simd(float *a, int n)
{
	float s = 0.0f;
	for (int i = 0; i < n; i++) {
		s += a[i]*a[i];
	}
	return s;
}
clang -O2 -ffast-math -mavx512f --save-temps -v -mllvm -print-after-all -c squared_norm_simd.c 2> vec512.log

The LLVM and assembly dumps:

*** IR Dump After LoopVectorizePass on squared_norm_simd ***
; Function Attrs: nofree norecurse nosync nounwind readonly uwtable
define dso_local float @squared_norm_simd(float* nocapture noundef readonly %0, i32 noundef %1) local_unnamed_addr #0 {
  %3 = icmp sgt i32 %1, 0
  br i1 %3, label %iter.check, label %._crit_edge

iter.check:                                       ; preds = %2
  %wide.trip.count = zext i32 %1 to i64
  %min.iters.check = icmp ult i64 %wide.trip.count, 8
  br i1 %min.iters.check, label %vec.epilog.scalar.ph, label %vector.main.loop.iter.check

vector.main.loop.iter.check:                      ; preds = %iter.check
  %min.iters.check15 = icmp ult i64 %wide.trip.count, 64
  br i1 %min.iters.check15, label %vec.epilog.ph, label %vector.ph

vector.ph:                                        ; preds = %vector.main.loop.iter.check
  %n.mod.vf = urem i64 %wide.trip.count, 64
  %n.vec = sub i64 %wide.trip.count, %n.mod.vf
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <16 x float> [ zeroinitializer, %vector.ph ], [ %24, %vector.body ]
  %vec.phi16 = phi <16 x float> [ zeroinitializer, %vector.ph ], [ %25, %vector.body ]
  %vec.phi17 = phi <16 x float> [ zeroinitializer, %vector.ph ], [ %26, %vector.body ]
  %vec.phi18 = phi <16 x float> [ zeroinitializer, %vector.ph ], [ %27, %vector.body ]
  %4 = add i64 %index, 0
  %5 = add i64 %index, 16
  %6 = add i64 %index, 32
  %7 = add i64 %index, 48
  %8 = getelementptr inbounds float, float* %0, i64 %4
  %9 = getelementptr inbounds float, float* %0, i64 %5
  %10 = getelementptr inbounds float, float* %0, i64 %6
  %11 = getelementptr inbounds float, float* %0, i64 %7
  %12 = getelementptr inbounds float, float* %8, i32 0
  %13 = bitcast float* %12 to <16 x float>*
  %wide.load = load <16 x float>, <16 x float>* %13, align 4, !tbaa !3
  %14 = getelementptr inbounds float, float* %8, i32 16
  %15 = bitcast float* %14 to <16 x float>*
  %wide.load19 = load <16 x float>, <16 x float>* %15, align 4, !tbaa !3
  %16 = getelementptr inbounds float, float* %8, i32 32
  %17 = bitcast float* %16 to <16 x float>*
  %wide.load20 = load <16 x float>, <16 x float>* %17, align 4, !tbaa !3
  %18 = getelementptr inbounds float, float* %8, i32 48
  %19 = bitcast float* %18 to <16 x float>*
  %wide.load21 = load <16 x float>, <16 x float>* %19, align 4, !tbaa !3
  %20 = fmul fast <16 x float> %wide.load, %wide.load
  %21 = fmul fast <16 x float> %wide.load19, %wide.load19
  %22 = fmul fast <16 x float> %wide.load20, %wide.load20
  %23 = fmul fast <16 x float> %wide.load21, %wide.load21
  %24 = fadd fast <16 x float> %20, %vec.phi
  %25 = fadd fast <16 x float> %21, %vec.phi16
  %26 = fadd fast <16 x float> %22, %vec.phi17
  %27 = fadd fast <16 x float> %23, %vec.phi18
  %index.next = add nuw i64 %index, 64
  %28 = icmp eq i64 %index.next, %n.vec
  br i1 %28, label %middle.block, label %vector.body, !llvm.loop !7

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd fast <16 x float> %25, %24
  %bin.rdx22 = fadd fast <16 x float> %26, %bin.rdx
  %bin.rdx23 = fadd fast <16 x float> %27, %bin.rdx22
  %29 = call fast float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %bin.rdx23)
  %cmp.n = icmp eq i64 %wide.trip.count, %n.vec
  br i1 %cmp.n, label %._crit_edge.loopexit, label %vec.epilog.iter.check

vec.epilog.iter.check:                            ; preds = %middle.block
  %n.vec.remaining = sub i64 %wide.trip.count, %n.vec
  %min.epilog.iters.check = icmp ult i64 %n.vec.remaining, 8
  br i1 %min.epilog.iters.check, label %vec.epilog.scalar.ph, label %vec.epilog.ph

vec.epilog.ph:                                    ; preds = %vector.main.loop.iter.check, %vec.epilog.iter.check
  %bc.merge.rdx = phi float [ 0.000000e+00, %vector.main.loop.iter.check ], [ %29, %vec.epilog.iter.check ]
  %vec.epilog.resume.val = phi i64 [ %n.vec, %vec.epilog.iter.check ], [ 0, %vector.main.loop.iter.check ]
  %n.mod.vf25 = urem i64 %wide.trip.count, 8
  %n.vec26 = sub i64 %wide.trip.count, %n.mod.vf25
  %30 = insertelement <8 x float> zeroinitializer, float %bc.merge.rdx, i32 0
  br label %vec.epilog.vector.body

vec.epilog.vector.body:                           ; preds = %vec.epilog.vector.body, %vec.epilog.ph
  %index28 = phi i64 [ %vec.epilog.resume.val, %vec.epilog.ph ], [ %index.next31, %vec.epilog.vector.body ]
  %vec.phi29 = phi <8 x float> [ %30, %vec.epilog.ph ], [ %36, %vec.epilog.vector.body ]
  %31 = add i64 %index28, 0
  %32 = getelementptr inbounds float, float* %0, i64 %31
  %33 = getelementptr inbounds float, float* %32, i32 0
  %34 = bitcast float* %33 to <8 x float>*
  %wide.load30 = load <8 x float>, <8 x float>* %34, align 4, !tbaa !3
  %35 = fmul fast <8 x float> %wide.load30, %wide.load30
  %36 = fadd fast <8 x float> %35, %vec.phi29
  %index.next31 = add nuw i64 %index28, 8
  %37 = icmp eq i64 %index.next31, %n.vec26
  br i1 %37, label %vec.epilog.middle.block, label %vec.epilog.vector.body, !llvm.loop !10

vec.epilog.middle.block:                          ; preds = %vec.epilog.vector.body
  %38 = call fast float @llvm.vector.reduce.fadd.v8f32(float -0.000000e+00, <8 x float> %36)
  %cmp.n27 = icmp eq i64 %wide.trip.count, %n.vec26
  br i1 %cmp.n27, label %._crit_edge.loopexit.loopexit, label %vec.epilog.scalar.ph

vec.epilog.scalar.ph:                             ; preds = %iter.check, %vec.epilog.iter.check, %vec.epilog.middle.block
  %bc.resume.val = phi i64 [ %n.vec26, %vec.epilog.middle.block ], [ %n.vec, %vec.epilog.iter.check ], [ 0, %iter.check ]
  %bc.merge.rdx32 = phi float [ 0.000000e+00, %iter.check ], [ %29, %vec.epilog.iter.check ], [ %38, %vec.epilog.middle.block ]
  br label %.lr.ph

._crit_edge.loopexit.loopexit:                    ; preds = %vec.epilog.middle.block, %.lr.ph
  %.lcssa24 = phi float [ %42, %.lr.ph ], [ %38, %vec.epilog.middle.block ]
  br label %._crit_edge.loopexit

._crit_edge.loopexit:                             ; preds = %._crit_edge.loopexit.loopexit, %middle.block
  %.lcssa = phi float [ %29, %middle.block ], [ %.lcssa24, %._crit_edge.loopexit.loopexit ]
  br label %._crit_edge

._crit_edge:                                      ; preds = %._crit_edge.loopexit, %2
  %.011.lcssa = phi float [ 0.000000e+00, %2 ], [ %.lcssa, %._crit_edge.loopexit ]
  ret float %.011.lcssa

.lr.ph:                                           ; preds = %vec.epilog.scalar.ph, %.lr.ph
  %indvars.iv = phi i64 [ %bc.resume.val, %vec.epilog.scalar.ph ], [ %indvars.iv.next, %.lr.ph ]
  %.01112 = phi float [ %bc.merge.rdx32, %vec.epilog.scalar.ph ], [ %42, %.lr.ph ]
  %39 = getelementptr inbounds float, float* %0, i64 %indvars.iv
  %40 = load float, float* %39, align 4, !tbaa !3
  %41 = fmul fast float %40, %40
  %42 = fadd fast float %41, %.01112
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond.not, label %._crit_edge.loopexit.loopexit, label %.lr.ph, !llvm.loop !12
}
	.text
	.file	"squared_norm_simd.c"
	.globl	squared_norm_simd               # -- Begin function squared_norm_simd
	.p2align	4, 0x90
	.type	squared_norm_simd,@function
squared_norm_simd:                      # @squared_norm_simd
	.cfi_startproc
# %bb.0:
	testl	%esi, %esi
	jle	.LBB0_1
# %bb.2:                                # %iter.check
	movl	%esi, %eax
	cmpl	$8, %esi
	jae	.LBB0_4
# %bb.3:
	vxorps	%xmm0, %xmm0, %xmm0
	xorl	%ecx, %ecx
	jmp	.LBB0_17
.LBB0_1:
	vxorps	%xmm0, %xmm0, %xmm0
	retq
.LBB0_4:                                # %vector.main.loop.iter.check
	cmpl	$64, %esi
	jae	.LBB0_6
# %bb.5:
	vxorps	%xmm0, %xmm0, %xmm0
	xorl	%ecx, %ecx
	jmp	.LBB0_14
.LBB0_6:                                # %vector.ph
	movl	%eax, %ecx
	andl	$-64, %ecx
	leaq	-64(%rcx), %rdx
	movq	%rdx, %r8
	shrq	$6, %r8
	addq	$1, %r8
	testq	%rdx, %rdx
	je	.LBB0_7
# %bb.8:                                # %vector.ph.new
	movq	%r8, %rdx
	andq	$-2, %rdx
	vxorps	%xmm0, %xmm0, %xmm0
	xorl	%esi, %esi
	vxorps	%xmm1, %xmm1, %xmm1
	vxorps	%xmm2, %xmm2, %xmm2
	vxorps	%xmm3, %xmm3, %xmm3
	.p2align	4, 0x90
.LBB0_9:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovups	(%rdi,%rsi,4), %zmm4
	vmovups	64(%rdi,%rsi,4), %zmm5
	vmovups	128(%rdi,%rsi,4), %zmm6
	vmovups	192(%rdi,%rsi,4), %zmm7
	vfmadd213ps	%zmm0, %zmm4, %zmm4     # zmm4 = (zmm4 * zmm4) + zmm0
	vfmadd213ps	%zmm1, %zmm5, %zmm5     # zmm5 = (zmm5 * zmm5) + zmm1
	vfmadd213ps	%zmm2, %zmm6, %zmm6     # zmm6 = (zmm6 * zmm6) + zmm2
	vfmadd213ps	%zmm3, %zmm7, %zmm7     # zmm7 = (zmm7 * zmm7) + zmm3
	vmovups	256(%rdi,%rsi,4), %zmm0
	vmovups	320(%rdi,%rsi,4), %zmm1
	vmovups	384(%rdi,%rsi,4), %zmm2
	vmovups	448(%rdi,%rsi,4), %zmm3
	vfmadd213ps	%zmm4, %zmm0, %zmm0     # zmm0 = (zmm0 * zmm0) + zmm4
	vfmadd213ps	%zmm5, %zmm1, %zmm1     # zmm1 = (zmm1 * zmm1) + zmm5
	vfmadd213ps	%zmm6, %zmm2, %zmm2     # zmm2 = (zmm2 * zmm2) + zmm6
	vfmadd213ps	%zmm7, %zmm3, %zmm3     # zmm3 = (zmm3 * zmm3) + zmm7
	subq	$-128, %rsi
	addq	$-2, %rdx
	jne	.LBB0_9
# %bb.10:                               # %middle.block.unr-lcssa
	testb	$1, %r8b
	je	.LBB0_12
.LBB0_11:                               # %vector.body.epil
	vmovups	(%rdi,%rsi,4), %zmm4
	vmovups	64(%rdi,%rsi,4), %zmm5
	vmovups	128(%rdi,%rsi,4), %zmm6
	vmovups	192(%rdi,%rsi,4), %zmm7
	vfmadd231ps	%zmm4, %zmm4, %zmm0     # zmm0 = (zmm4 * zmm4) + zmm0
	vfmadd231ps	%zmm5, %zmm5, %zmm1     # zmm1 = (zmm5 * zmm5) + zmm1
	vfmadd231ps	%zmm6, %zmm6, %zmm2     # zmm2 = (zmm6 * zmm6) + zmm2
	vfmadd231ps	%zmm7, %zmm7, %zmm3     # zmm3 = (zmm7 * zmm7) + zmm3
.LBB0_12:                               # %middle.block
	vaddps	%zmm3, %zmm1, %zmm1
	vaddps	%zmm2, %zmm0, %zmm0
	vaddps	%zmm1, %zmm0, %zmm0
	vextractf64x4	$1, %zmm0, %ymm1
	vaddps	%zmm1, %zmm0, %zmm0
	vextractf128	$1, %ymm0, %xmm1
	vaddps	%xmm1, %xmm0, %xmm0
	vpermilpd	$1, %xmm0, %xmm1        # xmm1 = xmm0[1,0]
	vaddps	%xmm1, %xmm0, %xmm0
	vmovshdup	%xmm0, %xmm1            # xmm1 = xmm0[1,1,3,3]
	vaddss	%xmm1, %xmm0, %xmm0
	cmpq	%rax, %rcx
	je	.LBB0_18
# %bb.13:                               # %vec.epilog.iter.check
	testb	$56, %al
	je	.LBB0_17
.LBB0_14:                               # %vec.epilog.ph
	movq	%rcx, %rdx
	movl	%eax, %ecx
	andl	$-8, %ecx
	vxorps	%xmm1, %xmm1, %xmm1
	vblendps	$1, %xmm0, %xmm1, %xmm0         # xmm0 = xmm0[0],xmm1[1,2,3]
	.p2align	4, 0x90
.LBB0_15:                               # %vec.epilog.vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovups	(%rdi,%rdx,4), %ymm1
	vfmadd231ps	%ymm1, %ymm1, %ymm0     # ymm0 = (ymm1 * ymm1) + ymm0
	addq	$8, %rdx
	cmpq	%rdx, %rcx
	jne	.LBB0_15
# %bb.16:                               # %vec.epilog.middle.block
	vextractf128	$1, %ymm0, %xmm1
	vaddps	%xmm1, %xmm0, %xmm0
	vpermilpd	$1, %xmm0, %xmm1        # xmm1 = xmm0[1,0]
	vaddps	%xmm1, %xmm0, %xmm0
	vmovshdup	%xmm0, %xmm1            # xmm1 = xmm0[1,1,3,3]
	vaddss	%xmm1, %xmm0, %xmm0
	cmpq	%rax, %rcx
	je	.LBB0_18
	.p2align	4, 0x90
.LBB0_17:                               # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
	vmovss	(%rdi,%rcx,4), %xmm1            # xmm1 = mem[0],zero,zero,zero
	vfmadd231ss	%xmm1, %xmm1, %xmm0     # xmm0 = (xmm1 * xmm1) + xmm0
	addq	$1, %rcx
	cmpq	%rcx, %rax
	jne	.LBB0_17
.LBB0_18:                               # %._crit_edge
	vzeroupper
	retq
.LBB0_7:
	vxorps	%xmm0, %xmm0, %xmm0
	xorl	%esi, %esi
	vxorps	%xmm1, %xmm1, %xmm1
	vxorps	%xmm2, %xmm2, %xmm2
	vxorps	%xmm3, %xmm3, %xmm3
	testb	$1, %r8b
	jne	.LBB0_11
	jmp	.LBB0_12
.Lfunc_end0:
	.size	squared_norm_simd, .Lfunc_end0-squared_norm_simd
	.cfi_endproc
                                        # -- End function
	.ident	"clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)"
	.section	".note.GNU-stack","",@progbits
	.addrsig

@chriselrod
Copy link
Contributor

FWIW, @turbo code for Float64:

        .text
        .file   "squared_norm_turbo"
        .globl  julia_squared_norm_turbo_4421   # -- Begin function julia_squared_norm_turbo_4421
        .p2align        4, 0x90
        .type   julia_squared_norm_turbo_4421,@function
julia_squared_norm_turbo_4421:          # @julia_squared_norm_turbo_4421
        .cfi_startproc
# %bb.0:                                # %top
        movabs  rsi, 9223372036854775744
        mov     rax, qword ptr [rdi]
        mov     rdx, qword ptr [rdi + 8]
        cmp     rdx, 64
        jae     .LBB0_2
# %bb.1:
        vxorpd  xmm0, xmm0, xmm0
        xor     ecx, ecx
        vxorpd  xmm1, xmm1, xmm1
        add     rsi, 48
        and     rsi, rdx
        cmp     rcx, rsi
        je      .LBB0_8
        jmp     .LBB0_6
.LBB0_2:                                # %L30.preheader
        mov     rcx, rdx
        and     rcx, rsi
        vxorpd  xmm1, xmm1, xmm1
        xor     edi, edi
        vxorpd  xmm0, xmm0, xmm0
        vxorpd  xmm3, xmm3, xmm3
        vxorpd  xmm4, xmm4, xmm4
        vxorpd  xmm2, xmm2, xmm2
        vxorpd  xmm5, xmm5, xmm5
        vxorpd  xmm6, xmm6, xmm6
        vxorpd  xmm7, xmm7, xmm7
        .p2align        4, 0x90
.LBB0_3:                                # %L49
                                        # =>This Inner Loop Header: Depth=1
        vmovupd zmm8, zmmword ptr [rax + 8*rdi]
        vmovupd zmm9, zmmword ptr [rax + 8*rdi + 64]
        vmovupd zmm10, zmmword ptr [rax + 8*rdi + 128]
        vmovupd zmm11, zmmword ptr [rax + 8*rdi + 192]
        vmovupd zmm12, zmmword ptr [rax + 8*rdi + 256]
        vmovupd zmm13, zmmword ptr [rax + 8*rdi + 320]
        vmovupd zmm14, zmmword ptr [rax + 8*rdi + 384]
        vmovupd zmm15, zmmword ptr [rax + 8*rdi + 448]
        vfmadd231pd     zmm7, zmm8, zmm8        # zmm7 = (zmm8 * zmm8) + zmm7
        vfmadd231pd     zmm6, zmm9, zmm9        # zmm6 = (zmm9 * zmm9) + zmm6
        vfmadd231pd     zmm5, zmm10, zmm10      # zmm5 = (zmm10 * zmm10) + zmm5
        vfmadd231pd     zmm2, zmm11, zmm11      # zmm2 = (zmm11 * zmm11) + zmm2
        vfmadd231pd     zmm4, zmm12, zmm12      # zmm4 = (zmm12 * zmm12) + zmm4
        vfmadd231pd     zmm3, zmm13, zmm13      # zmm3 = (zmm13 * zmm13) + zmm3
        vfmadd231pd     zmm0, zmm14, zmm14      # zmm0 = (zmm14 * zmm14) + zmm0
        vfmadd231pd     zmm1, zmm15, zmm15      # zmm1 = (zmm15 * zmm15) + zmm1
        add     rdi, 64
        cmp     rcx, rdi
        jne     .LBB0_3
# %bb.4:                                # %L79
        vaddpd  zmm4, zmm7, zmm4
        vaddpd  zmm3, zmm6, zmm3
        vaddpd  zmm0, zmm5, zmm0
        vaddpd  zmm0, zmm4, zmm0
        vaddpd  zmm1, zmm2, zmm1
        vaddpd  zmm1, zmm3, zmm1
        add     rsi, 48
        and     rsi, rdx
        cmp     rcx, rsi
        je      .LBB0_8
        .p2align        4, 0x90
.LBB0_6:                                # %L114
                                        # =>This Inner Loop Header: Depth=1
        vmovupd zmm2, zmmword ptr [rax + 8*rcx]
        vmovupd zmm3, zmmword ptr [rax + 8*rcx + 64]
        vfmadd231pd     zmm0, zmm2, zmm2        # zmm0 = (zmm2 * zmm2) + zmm0
        vfmadd231pd     zmm1, zmm3, zmm3        # zmm1 = (zmm3 * zmm3) + zmm1
        add     rcx, 16
        cmp     rsi, rcx
        jne     .LBB0_6
# %bb.7:
        mov     rcx, rsi
.LBB0_8:                                # %L123
        cmp     rcx, rdx
        jae     .LBB0_12
# %bb.9:                                # %L125
        mov     esi, edx
        and     esi, 7
        mov     edi, 8
        cmovne  edi, esi
        mov     esi, -1
        bzhi    esi, esi, edi
        add     rdx, -9
        cmp     rdx, rcx
        jge     .LBB0_11
# %bb.10:                               # %L136
        kmovd   k1, esi
        vmovupd zmm2 {k1} {z}, zmmword ptr [rax + 8*rcx]
        vfmadd231pd     zmm0 {k1}, zmm2, zmm2   # zmm0 {%k1} = (zmm2 * zmm2) + zmm0
        jmp     .LBB0_12
.LBB0_11:                               # %L143
        vmovupd zmm2, zmmword ptr [rax + 8*rcx]
        kmovd   k1, esi
        vmovupd zmm3 {k1} {z}, zmmword ptr [rax + 8*rcx + 64]
        vfmadd231pd     zmm0, zmm2, zmm2        # zmm0 = (zmm2 * zmm2) + zmm0
        vfmadd231pd     zmm1 {k1}, zmm3, zmm3   # zmm1 {%k1} = (zmm3 * zmm3) + zmm1
.LBB0_12:                               # %L152
        vaddpd  zmm0, zmm0, zmm1
        vextractf64x4   ymm1, zmm0, 1
        vaddpd  zmm0, zmm0, zmm1
        vextractf128    xmm1, ymm0, 1
        vaddpd  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 1           # xmm1 = xmm0[1,0]
        vaddsd  xmm0, xmm0, xmm1
        vzeroupper
        ret
.Lfunc_end0:
        .size   julia_squared_norm_turbo_4421, .Lfunc_end0-julia_squared_norm_turbo_4421
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

For Float32:

        .text
        .file   "squared_norm_turbo"
        .globl  julia_squared_norm_turbo_4284   # -- Begin function julia_squared_norm_turbo_4284
        .p2align        4, 0x90
        .type   julia_squared_norm_turbo_4284,@function
julia_squared_norm_turbo_4284:          # @julia_squared_norm_turbo_4284
        .cfi_startproc
# %bb.0:                                # %top
        movabs  rsi, 9223372036854775680
        mov     rax, qword ptr [rdi]
        mov     rdx, qword ptr [rdi + 8]
        cmp     rdx, 128
        jae     .LBB0_2
# %bb.1:
        vxorps  xmm0, xmm0, xmm0
        xor     ecx, ecx
        vxorps  xmm1, xmm1, xmm1
        add     rsi, 96
        and     rsi, rdx
        cmp     rcx, rsi
        je      .LBB0_8
        jmp     .LBB0_6
.LBB0_2:                                # %L30.preheader
        mov     rcx, rdx
        and     rcx, rsi
        vxorps  xmm1, xmm1, xmm1
        xor     edi, edi
        vxorps  xmm0, xmm0, xmm0
        vxorps  xmm3, xmm3, xmm3
        vxorps  xmm4, xmm4, xmm4
        vxorps  xmm2, xmm2, xmm2
        vxorps  xmm5, xmm5, xmm5
        vxorps  xmm6, xmm6, xmm6
        vxorps  xmm7, xmm7, xmm7
        .p2align        4, 0x90
.LBB0_3:                                # %L49
                                        # =>This Inner Loop Header: Depth=1
        vmovups zmm8, zmmword ptr [rax + 4*rdi]
        vmovups zmm9, zmmword ptr [rax + 4*rdi + 64]
        vmovups zmm10, zmmword ptr [rax + 4*rdi + 128]
        vmovups zmm11, zmmword ptr [rax + 4*rdi + 192]
        vmovups zmm12, zmmword ptr [rax + 4*rdi + 256]
        vmovups zmm13, zmmword ptr [rax + 4*rdi + 320]
        vmovups zmm14, zmmword ptr [rax + 4*rdi + 384]
        vmovups zmm15, zmmword ptr [rax + 4*rdi + 448]
        vfmadd231ps     zmm7, zmm8, zmm8        # zmm7 = (zmm8 * zmm8) + zmm7
        vfmadd231ps     zmm6, zmm9, zmm9        # zmm6 = (zmm9 * zmm9) + zmm6
        vfmadd231ps     zmm5, zmm10, zmm10      # zmm5 = (zmm10 * zmm10) + zmm5
        vfmadd231ps     zmm2, zmm11, zmm11      # zmm2 = (zmm11 * zmm11) + zmm2
        vfmadd231ps     zmm4, zmm12, zmm12      # zmm4 = (zmm12 * zmm12) + zmm4
        vfmadd231ps     zmm3, zmm13, zmm13      # zmm3 = (zmm13 * zmm13) + zmm3
        vfmadd231ps     zmm0, zmm14, zmm14      # zmm0 = (zmm14 * zmm14) + zmm0
        vfmadd231ps     zmm1, zmm15, zmm15      # zmm1 = (zmm15 * zmm15) + zmm1
        sub     rdi, -128
        cmp     rcx, rdi
        jne     .LBB0_3
# %bb.4:                                # %L79
        vaddps  zmm4, zmm7, zmm4
        vaddps  zmm3, zmm6, zmm3
        vaddps  zmm0, zmm5, zmm0
        vaddps  zmm0, zmm4, zmm0
        vaddps  zmm1, zmm2, zmm1
        vaddps  zmm1, zmm3, zmm1
        add     rsi, 96
        and     rsi, rdx
        cmp     rcx, rsi
        je      .LBB0_8
        .p2align        4, 0x90
.LBB0_6:                                # %L114
                                        # =>This Inner Loop Header: Depth=1
        vmovups zmm2, zmmword ptr [rax + 4*rcx]
        vmovups zmm3, zmmword ptr [rax + 4*rcx + 64]
        vfmadd231ps     zmm0, zmm2, zmm2        # zmm0 = (zmm2 * zmm2) + zmm0
        vfmadd231ps     zmm1, zmm3, zmm3        # zmm1 = (zmm3 * zmm3) + zmm1
        add     rcx, 32
        cmp     rsi, rcx
        jne     .LBB0_6
# %bb.7:
        mov     rcx, rsi
.LBB0_8:                                # %L123
        cmp     rcx, rdx
        jae     .LBB0_12
# %bb.9:                                # %L125
        mov     esi, edx
        and     esi, 15
        mov     edi, 16
        cmovne  edi, esi
        mov     esi, -1
        bzhi    esi, esi, edi
        add     rdx, -17
        cmp     rdx, rcx
        jge     .LBB0_11
# %bb.10:                               # %L136
        kmovd   k1, esi
        vmovups zmm2 {k1} {z}, zmmword ptr [rax + 4*rcx]
        vfmadd231ps     zmm0 {k1}, zmm2, zmm2   # zmm0 {%k1} = (zmm2 * zmm2) + zmm0
        jmp     .LBB0_12
.LBB0_11:                               # %L143
        vmovups zmm2, zmmword ptr [rax + 4*rcx]
        kmovd   k1, esi
        vmovups zmm3 {k1} {z}, zmmword ptr [rax + 4*rcx + 64]
        vfmadd231ps     zmm0, zmm2, zmm2        # zmm0 = (zmm2 * zmm2) + zmm0
        vfmadd231ps     zmm1 {k1}, zmm3, zmm3   # zmm1 {%k1} = (zmm3 * zmm3) + zmm1
.LBB0_12:                               # %L152
        vaddps  zmm0, zmm0, zmm1
        vextractf64x4   ymm1, zmm0, 1
        vaddps  zmm0, zmm0, zmm1
        vextractf128    xmm1, ymm0, 1
        vaddps  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 1           # xmm1 = xmm0[1,0]
        vaddps  xmm0, xmm0, xmm1
        vmovshdup       xmm1, xmm0              # xmm1 = xmm0[1,1,3,3]
        vaddss  xmm0, xmm0, xmm1
        vzeroupper
        ret
.Lfunc_end0:
        .size   julia_squared_norm_turbo_4284, .Lfunc_end0-julia_squared_norm_turbo_4284
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

It unrolls by 8 and loops, reduces to 2 vectors and loops, and then handles the rest with branches and masking.

@pengtu

This comment was marked as off-topic.

@turbo

This comment was marked as off-topic.

@pengtu

This comment was marked as off-topic.

@oscardssmith
Copy link
Member

oscardssmith commented Aug 30, 2022

@turbo is what LoopVectorization.jl provides (it used to be called @avx).

@pengtu
Copy link
Author

pengtu commented Aug 30, 2022

I will ask Intel's LLVM team to take a look if it is beneficial to also unroll the AVX512 remainder vector loop by 4 so it matches AVX2 for small inputs.

@chriselrod
Copy link
Contributor

@chriselrod Is @turbo the new vectorizer for Julia?

https://github.com/JuliaSIMD/LoopVectorization.jl
Does cost modeling of a some loops where all arrays and scalars have primitive types, and then generates code using SIMD intrinsics.
https://github.com/JuliaSIMD/LoopModels
is an ongoing rewrite that addresses many of the problems with LoopVectorization.jl. It will work as an LLVM function pass.

@pengtu
Copy link
Author

pengtu commented Aug 31, 2022

If LoopVectorization.jl generates the SIMD code, then the unroll amount of <16, f32> loop, whether to generate <8, f32> remainder loop and how much to unroll it, and others are all decided by LoopVectorization.jl. LLVM's LoopVectorizePass is not used by Julia.

Then, what is the effect of -mprefer-vector-width=512 to LLVM compiler in the following quote?

"You need to start Julia with -C"native,-prefer-256-bit" (disable 256 bit preference, letting it use 512 bit) or if running C/C++/Fortran using gcc/clang compile with -mprefer-vector-width=512 (enable 512 bit preference, letting it use 512 bit)."

My original impression of reading this quote is that LoopVectorization.jl generates a scalar LLVM loop with a
#pragma clang loop vectorize(enable); sets prefer-vector-width; and let the LLVM LoopVectorizePass to vectorize the loop. Hence the fix for this issue is in LLVM LoopVectorizePass.

Now, it seems that we only need to change LoopVectorization.jl such that the 512 bit vectorization code is no slower than 256 bit vectorization in general, if we ignore the effect of frequency throttling and array length test for entering the 512 bit loop.

@vchuravy
Copy link
Member

There are two different prices of code, pure Julia as is does use LLVM loop vectorized as part of our LLVM pipeline (I recommend taking a look at that of you are interested, since we don't re-use Clangs pipeline.)

There are various Julia packages, LoopVectorization.jl being the most prominent that allow users to short-circuit to some extend the default pipeline and that implement their own scheduling.

So for all the code that doesn't use LV.jl the LLVM options you mentioned is still necessary. So improving LLVM vectorization support will also be greatly beneficial for Julia

@chriselrod
Copy link
Contributor

chriselrod commented Aug 31, 2022

This was the first thing I fixed in the rewrite) and causes excessive latency because of the way it's designed (also fixed).
It works by generating Julia code that uses SIMD instrinsics.
LoopVectorization.jl doesn't modify Julia's compilation pipeline in that sense.
It works first on the expression level (it is a macro) and then emits a call to an @generated function, which can process some type information to generate code containing SIMD intrinsics.
It does a fair bit of processing, but it would be much faster, and be far more general, if it were a compiler pass, e.g. if it were an LLVM pass that we used by changing the compilation pipeline.

LoopVectorization.jl also doesn't do any real dependence analysis. LoopModels will ultimately replace LoopVectorization (but dependence analysis is actually the only thing it does so far); it'll work more like Enzyme + Enzyme.jl, by creating a custom LLVM pass pipeline that includes the LoopModels pass.
This should be much better for latency, make it much more general (e.g., support many more data types), play better with system image creation and multiversioning, etc.

Most vectorized Julia code today relies on LLVM's LoopVectorize, as vchuravy said.
This will probably continue into the future. I'm hoping LoopModel's latency will be much better than LoopVectorization's (it should be, as it won't push extra code through most of the pipeline), but it'd still require a dependency, and still do much heavier analysis than LoopVectorize, so I'm expecting most people to still rely on the default LoopVectorize pass.

Now, it seems that we only need to change LoopVectorization.jl such that the 512 bit vectorization code is no slower than 256 bit vectorization in general, if we ignore the effect of frequency throttling and array length test for entering the 512 bit loop.

LoopVectorization.jl is already generating 512 bit code that is (almost?) always faster than 256 bit versions.
There are some cases where the code it generates is slower (e.g., loops of length 3 or less) as it always takes vector paths and masks, which adds the overhead of reducing a vector.
But I'm fairly happy with the code generation there, and in cases like I demonstrated here, it performs significantly better than LLVM LoopVectorizePass when averaged over all sizes 1,...,256 for this example.
This approach doesn't seem particularly advantageous with 256 bit, and seems like it may often be worse with 128 bit, e.g. on the M1 Mac. But I did just try the benchmark on my Mac Mini, and again saw LoopVectorization.jl's code generation yielded much better results for this particular benchmark:

julia> @benchmark test_all_sizes!(squared_norm_simd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min  max):  4.649 μs   6.464 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     4.679 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.696 μs ± 94.129 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▅▇█▅▄▅▇▃▁▁          ▁                                ▂    ▂
  ███████████▅▁▁▁▁▁▁▄▅██▆▃▁▁▁▁▃▁▁▃▃▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▅██▇▆▇ █
  4.65 μs      Histogram: log(frequency) by time     5.09 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test_all_sizes!(squared_norm_turbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.218 μs   3.514 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.231 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.235 μs ± 43.030 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▄       █▁                                  
  ▂▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁█▁▁▁▁▁▁▁██▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂ ▂
  2.22 μs        Histogram: frequency by time        2.25 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, that's probably mostly because LLVM does not unroll aggressively on the M1 at all, despite the M1 having tons of execution resources.

Of course, an average placing uniform weight on all sizes 1-256 isn't a perfect representation of average speed, and scalar loops on fallback would have the advantage of avoiding SIMD instructions altogether, which may be nice if they're called from largely non-SIMD code so that the units could potentially be turned off.

I will, in a few months, start writing new SIMD code generation for LoopModels.
This will probably broadly follow the approaches I've taken in LoopVectorization.jl, but I'm interested in the merits of alternatives in case it may be better to take a different approach, at least in some situations.

@pengtu
Copy link
Author

pengtu commented Aug 31, 2022

Thank you! Let's take this offline to find a way to the goal of "always default to the max vector register size" in the vectorized code.

@brenhinkeller brenhinkeller added the compiler:simd instruction-level vectorization label Nov 19, 2022
@giordano
Copy link
Contributor

giordano commented Feb 9, 2023

Fixed by #46499

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:simd instruction-level vectorization float16
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants