reduction operations like sum(::Vec) generate scalar code #91

mchristianl · 2022-01-09T14:34:30Z

Hi,

I am learning about SIMD implementation. In the following example, an eight-component dot-product is implemented

dot1(A::Vec{8,Float32},B::Vec{8,Float32}) = sum(A * B)

Here, the multiplication generates a single AVX256 vmulps-instruction and a reduction needs to be performed for the "horizontal" sum operation. I would expect a hierarchical approach, but when looking at the generated assembly, one can notice that the sum is basically performed in a scalar manner:

vmovups  (%rdi), %ymm0        ; ymm0  = A[0,…,7]
vmulps   (%rsi), %ymm0, %ymm0 ; ymm0 *= B[0,…,7]      ⇒ C[0,…,7]
vxorps    %xmm1, %xmm1, %xmm1 ; xmm1  = 0 (indicate the CPU that we do not depend on xmm1)
vaddss    %xmm1, %xmm0, %xmm1 ; xmm1 += xmm1          ⇒ C[0,1,2,3]
vmovshdup %xmm0, %xmm2        ; xmm2  = xmm0[1,1,3,3] ⇒ C[1,1,3,3]
vaddss    %xmm2, %xmm1, %xmm1 ; xmm1 += xmm2          ⇒ C[0+1,…,…,…]
vpermilpd    $1, %xmm0, %xmm2 ; xmm2  = xmm0[1,0]     ⇒ C[2,3,0,1]
vaddss    %xmm2, %xmm1, %xmm1 ; xmm1 += xmm2          ⇒ C[0+1+2,…,…,…]
vpermilps  $255, %xmm0, %xmm2 ; xmm2  = xmm0[3,3,3,3] ⇒ C[3,3,3,3]
vaddss    %xmm2, %xmm1, %xmm1 ; xmm1 += xmm2          ⇒ C[0+1+2+3,…,…,…]
vextractf128 $1, %ymm0, %xmm0 ; xmm0  = ymm0[4,5,6,7] ⇒ C[4,5,6,7]
vaddss    %xmm0, %xmm1, %xmm1 ; xmm1 += xmm0          ⇒ C[0+1+2+3+4,…,…,…]
vmovshdup %xmm0, %xmm2        ; xmm2  = xmm0[1,1,3,3] ⇒ C[5,5,7,7]
vaddss    %xmm2, %xmm1, %xmm1 ; xmm1 += xmm2          ⇒ C[0+1+2+3+4+5,…,…,…]
vpermilpd    $1, %xmm0, %xmm2 ; xmm2  = xmm0[1,0]     ⇒ C[6,7,4,5]
vaddss    %xmm2, %xmm1, %xmm1 ; xmm1 += xmm2          ⇒ C[0+1+2+3+4+5+6,…,…,…]
vpermilps  $255, %xmm0, %xmm0 ; xmm1  = xmm0[3,3,3,3] ⇒ C[7,7,7,7]
vaddss    %xmm0, %xmm1, %xmm0 ; xmm0 += xmm1          ⇒ C[0+1+2+3+4+5+6+7,…,…,…]
vzeroupper                    ; ymm*[4,5,6,7] = 0
retq                          ; result is the lowest part of xmm0

however, when the reduction is typed out manually

function dot2(A::Vec{8,Float32},B::Vec{8,Float32})
  v0 = A*B
  l0 = Vec(v0[1],v0[3],v0[5],v0[7])
  r0 = Vec(v0[2],v0[4],v0[6],v0[8])
  v1 = l0+r0
  l1 = Vec(v1[1],v1[3])
  r1 = Vec(v1[2],v1[4])
  v2 = l1+r1
  return v2[1]+v2[2]
end

then, two fancy vhaddps instructions are generated and the performance increases a lot

vmovups  (%rdi), %ymm0        ; ymm0  = A[0,…,7]
vmulps   (%rsi), %ymm0, %ymm0 ; ymm0 *= B[0,…,7]      ⇒ C[0,…,7]
vextractf128 $1, %ymm0, %xmm1 ; xmm1  = ymm0[4,5,6,7] ⇒ C[4,5,6,7]
vhaddps   %xmm1, %xmm0, %xmm0 ; xmm0  = ...           ⇒ C[5+4,7+6,1+0,3+2]
vhaddps   %xmm0, %xmm0, %xmm0 ; xmm0  = ...           ⇒ C[7+6+5+4,3+2+1+0,7+6+5+4,3+2+1+0]
vmovshdup %xmm0, %xmm1        ; xmm1  = xmm0[1,1,3,3] ⇒ C[3+2+1+0,3+2+1+0,3+2+1+0,3+2+1+0]
vaddss    %xmm1, %xmm0, %xmm0 ; xmm0 += xmm1          ⇒ C[7+6+5+4+3+2+1+0,…,…,…]
vzeroupper                    ; ymm*[4,5,6,7] = 0
retq                          ; result is the lowest part of xmm0

Have I misused the sum operation or is it not (yet?) intended to generate that kind of "hierarchical" approach?
Is there some other way to geht the vhaddps (or similar) instructions here? (is this a missing-@fastmath problem?)

kind regards,
Christian

Julia Version 1.7.1
Commit ac5cc99908* (2021-12-22 19:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, skylake)
Environment:
  JULIA_LOAD_PATH = /usr/share/gmsh/api/julia/:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

The text was updated successfully, but these errors were encountered:

KristofferC · 2022-01-09T16:24:38Z

Right now, we simply call the respective vector reduction intrinsic from LLVM: https://llvm.org/docs/LangRef.html#vector-reduction-intrinsics. However, there is this note:

If the intrinsic call has the ‘reassoc’ flag set, then the reduction will not preserve the associativity of an equivalent scalarized counterpart. Otherwise the reduction will be sequential, thus implying that the operation respects the associativity of a scalarized reduction.

So maybe we should set that. Would be a nice first PR to add it.

mchristianl · 2022-01-10T09:36:48Z

For the PR: I wonder why this reassoc-behavior not triggered by adding @fastmath to the dot1 function ... ?

KristofferC · 2022-01-10T09:45:31Z

@fastmath just rewrites expressions with the _fast suffix:

julia> @macroexpand @fastmath sum(a*b+c)
:(sum(Base.FastMath.add_fast(Base.FastMath.mul_fast(a, b), c)))

But sum is not one of the symbols it rewrites, and even if it did, we haven't defined a sum_fast in this package.

KristofferC · 2022-01-10T09:59:25Z

I would say this is closed by #92.

mchristianl · 2022-01-10T10:02:18Z

Thank you!

KristofferC mentioned this issue Jan 9, 2022

allow reassociation in fp sum and prod #92

Merged

KristofferC closed this as completed Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduction operations like sum(::Vec) generate scalar code #91

reduction operations like sum(::Vec) generate scalar code #91

mchristianl commented Jan 9, 2022

KristofferC commented Jan 9, 2022 •

edited

Loading

mchristianl commented Jan 10, 2022

KristofferC commented Jan 10, 2022 •

edited

Loading

KristofferC commented Jan 10, 2022

mchristianl commented Jan 10, 2022

reduction operations like sum(::Vec) generate scalar code #91

reduction operations like sum(::Vec) generate scalar code #91

Comments

mchristianl commented Jan 9, 2022

KristofferC commented Jan 9, 2022 • edited Loading

mchristianl commented Jan 10, 2022

KristofferC commented Jan 10, 2022 • edited Loading

KristofferC commented Jan 10, 2022

mchristianl commented Jan 10, 2022

KristofferC commented Jan 9, 2022 •

edited

Loading

KristofferC commented Jan 10, 2022 •

edited

Loading