Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduction operations like sum(::Vec) generate scalar code #91

Closed
mchristianl opened this issue Jan 9, 2022 · 5 comments
Closed

reduction operations like sum(::Vec) generate scalar code #91

mchristianl opened this issue Jan 9, 2022 · 5 comments

Comments

@mchristianl
Copy link

Hi,

I am learning about SIMD implementation. In the following example, an eight-component dot-product is implemented

dot1(A::Vec{8,Float32},B::Vec{8,Float32}) = sum(A * B)

Here, the multiplication generates a single AVX256 vmulps-instruction and a reduction needs to be performed for the "horizontal" sum operation. I would expect a hierarchical approach, but when looking at the generated assembly, one can notice that the sum is basically performed in a scalar manner:

vmovups  (%rdi), %ymm0        ; ymm0  = A[0,…,7]
vmulps   (%rsi), %ymm0, %ymm0 ; ymm0 *= B[0,…,7]      ⇒ C[0,…,7]
vxorps    %xmm1, %xmm1, %xmm1 ; xmm1  = 0 (indicate the CPU that we do not depend on xmm1)
vaddss    %xmm1, %xmm0, %xmm1 ; xmm1 += xmm1          ⇒ C[0,1,2,3]
vmovshdup %xmm0, %xmm2        ; xmm2  = xmm0[1,1,3,3] ⇒ C[1,1,3,3]
vaddss    %xmm2, %xmm1, %xmm1 ; xmm1 += xmm2          ⇒ C[0+1,…,…,…]
vpermilpd    $1, %xmm0, %xmm2 ; xmm2  = xmm0[1,0]     ⇒ C[2,3,0,1]
vaddss    %xmm2, %xmm1, %xmm1 ; xmm1 += xmm2          ⇒ C[0+1+2,…,…,…]
vpermilps  $255, %xmm0, %xmm2 ; xmm2  = xmm0[3,3,3,3] ⇒ C[3,3,3,3]
vaddss    %xmm2, %xmm1, %xmm1 ; xmm1 += xmm2          ⇒ C[0+1+2+3,…,…,…]
vextractf128 $1, %ymm0, %xmm0 ; xmm0  = ymm0[4,5,6,7] ⇒ C[4,5,6,7]
vaddss    %xmm0, %xmm1, %xmm1 ; xmm1 += xmm0          ⇒ C[0+1+2+3+4,…,…,…]
vmovshdup %xmm0, %xmm2        ; xmm2  = xmm0[1,1,3,3] ⇒ C[5,5,7,7]
vaddss    %xmm2, %xmm1, %xmm1 ; xmm1 += xmm2          ⇒ C[0+1+2+3+4+5,…,…,…]
vpermilpd    $1, %xmm0, %xmm2 ; xmm2  = xmm0[1,0]     ⇒ C[6,7,4,5]
vaddss    %xmm2, %xmm1, %xmm1 ; xmm1 += xmm2          ⇒ C[0+1+2+3+4+5+6,…,…,…]
vpermilps  $255, %xmm0, %xmm0 ; xmm1  = xmm0[3,3,3,3] ⇒ C[7,7,7,7]
vaddss    %xmm0, %xmm1, %xmm0 ; xmm0 += xmm1          ⇒ C[0+1+2+3+4+5+6+7,…,…,…]
vzeroupper                    ; ymm*[4,5,6,7] = 0
retq                          ; result is the lowest part of xmm0

however, when the reduction is typed out manually

function dot2(A::Vec{8,Float32},B::Vec{8,Float32})
  v0 = A*B
  l0 = Vec(v0[1],v0[3],v0[5],v0[7])
  r0 = Vec(v0[2],v0[4],v0[6],v0[8])
  v1 = l0+r0
  l1 = Vec(v1[1],v1[3])
  r1 = Vec(v1[2],v1[4])
  v2 = l1+r1
  return v2[1]+v2[2]
end

then, two fancy vhaddps instructions are generated and the performance increases a lot

vmovups  (%rdi), %ymm0        ; ymm0  = A[0,…,7]
vmulps   (%rsi), %ymm0, %ymm0 ; ymm0 *= B[0,…,7]      ⇒ C[0,…,7]
vextractf128 $1, %ymm0, %xmm1 ; xmm1  = ymm0[4,5,6,7] ⇒ C[4,5,6,7]
vhaddps   %xmm1, %xmm0, %xmm0 ; xmm0  = ...           ⇒ C[5+4,7+6,1+0,3+2]
vhaddps   %xmm0, %xmm0, %xmm0 ; xmm0  = ...           ⇒ C[7+6+5+4,3+2+1+0,7+6+5+4,3+2+1+0]
vmovshdup %xmm0, %xmm1        ; xmm1  = xmm0[1,1,3,3] ⇒ C[3+2+1+0,3+2+1+0,3+2+1+0,3+2+1+0]
vaddss    %xmm1, %xmm0, %xmm0 ; xmm0 += xmm1          ⇒ C[7+6+5+4+3+2+1+0,…,…,…]
vzeroupper                    ; ymm*[4,5,6,7] = 0
retq                          ; result is the lowest part of xmm0

Have I misused the sum operation or is it not (yet?) intended to generate that kind of "hierarchical" approach?
Is there some other way to geht the vhaddps (or similar) instructions here? (is this a missing-@fastmath problem?)

kind regards,
Christian

Julia Version 1.7.1
Commit ac5cc99908* (2021-12-22 19:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, skylake)
Environment:
  JULIA_LOAD_PATH = /usr/share/gmsh/api/julia/:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4
@KristofferC
Copy link
Collaborator

KristofferC commented Jan 9, 2022

Right now, we simply call the respective vector reduction intrinsic from LLVM: https://llvm.org/docs/LangRef.html#vector-reduction-intrinsics. However, there is this note:

If the intrinsic call has the ‘reassoc’ flag set, then the reduction will not preserve the associativity of an equivalent scalarized counterpart. Otherwise the reduction will be sequential, thus implying that the operation respects the associativity of a scalarized reduction.

So maybe we should set that. Would be a nice first PR to add it.

@mchristianl
Copy link
Author

For the PR: I wonder why this reassoc-behavior not triggered by adding @fastmath to the dot1 function ... ?

@KristofferC
Copy link
Collaborator

KristofferC commented Jan 10, 2022

@fastmath just rewrites expressions with the _fast suffix:

julia> @macroexpand @fastmath sum(a*b+c)
:(sum(Base.FastMath.add_fast(Base.FastMath.mul_fast(a, b), c)))

But sum is not one of the symbols it rewrites, and even if it did, we haven't defined a sum_fast in this package.

@KristofferC
Copy link
Collaborator

I would say this is closed by #92.

@mchristianl
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants