Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in code quality for horizontal add after 70a54bca6f #94546

Closed
dyung opened this issue Jun 5, 2024 · 9 comments · Fixed by #114101
Closed

Regression in code quality for horizontal add after 70a54bca6f #94546

dyung opened this issue Jun 5, 2024 · 9 comments · Fixed by #114101

Comments

@dyung
Copy link
Collaborator

dyung commented Jun 5, 2024

We have an internal test which tests whether the compiler generates horizontal add instructions for certain cases. Recently we noticed that one of the cases, the code generated seems to have gotten worse after a recent change 70a54bc.

Consider the following code:

__attribute__((noinline))
__m256d add_pd_004(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, -1, 3);
}

If compiled with optimizations targeting btver2 (-S -O2 -march=btver2), the compiler previously generated the following code:

        vhaddpd ymm0, ymm0, ymm1
        ret

But after 70a54bc, the compiler now is generating worse code:

        vextractf128    xmm1, ymm1, 1
        vhaddpd xmm0, xmm0, xmm1
        vinsertf128     ymm0, ymm0, xmm0, 1
        ret

@alexey-bataev, this was your change, can you take a look to see if there is a way we can avoid the regression in code quality in this case?

@dyung
Copy link
Collaborator Author

dyung commented Jun 5, 2024

Godbolt link: https://godbolt.org/z/71Pnv5T7T

@alexey-bataev
Copy link
Member

LLVM IR is better:
18.1.0

define dso_local noundef <4 x double> @add_pd_004(double vector[4], double vector[4])(<4 x double> noundef %a, <4 x double> noundef %b) local_unnamed_addr {
entry:
  %vecext = extractelement <4 x double> %a, i64 0
  %vecext1 = extractelement <4 x double> %a, i64 1
  %add = fadd double %vecext, %vecext1
  %0 = insertelement <4 x double> poison, double %add, i64 0
  %vecext10 = extractelement <4 x double> %b, i64 2
  %vecext11 = extractelement <4 x double> %b, i64 3
  %add12 = fadd double %vecext10, %vecext11
  %shuffle = insertelement <4 x double> %0, double %add12, i64 3
  ret <4 x double> %shuffle
}

Trunc:

define dso_local noundef <4 x double> @add_pd_004(double vector[4], double vector[4])(<4 x double> noundef %a, <4 x double> noundef %b) local_unnamed_addr {
entry:
  %0 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 0, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 1, i32 7>
  %2 = fadd <2 x double> %0, %1
  %3 = shufflevector <2 x double> %2, <2 x double> poison, <4 x i32> <i32 0, i32 poison, i32 poison, i32 1>
  ret <4 x double> %3
}

declare void @llvm.dbg.value(metadata, metadata, metadata) #1

Looks like codegen previously recognized the pattern, but currently not

@RKSimon
Copy link
Collaborator

RKSimon commented Jun 10, 2024

But this would have been even simpler:

define <4 x double> @add_pd_004(<4 x double> noundef %a, <4 x double> noundef %b)  {
entry:
  %0 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 poison, i32 poison, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 1, i32 poison, i32 poison, i32 7>
  %2 = fadd <4 x double> %0, %1
  ret <4 x double> %2
}

@RKSimon RKSimon self-assigned this Jun 10, 2024
@alexey-bataev
Copy link
Member

Still, looks like some kind of cost issue, will double check in a couple weeks

@RKSimon
Copy link
Collaborator

RKSimon commented Jun 10, 2024

Yes, its purely down to the shuffle costs not recognizing that the v4f64 shuffle mask doesn't cross 128-bit lanes so the worst case cost is much higher than necessary

@RKSimon
Copy link
Collaborator

RKSimon commented Jun 18, 2024

@alexey-bataev SLP is only calling getShuffleCost 3 times:

CostA <4 x double> <i32 0, i32 undef, i32 undef, i32 1> = 2
CostB <2 x double> <i32 0, i32 2> = 1
CostC <2 x double> <i32 1, i32 3> = 1

But then creates:

  %0 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 0, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 1, i32 7>
  %2 = fadd <2 x double> %0, %1
  %3 = shufflevector <2 x double> %2, <2 x double> poison, <4 x i32> <i32 0, i32 poison, i32 poison, i32 1>

The costs of %0 and %1 will be different to CostB / CostC - have we lost the cost of additional extract_subvector someplace?

@alexey-bataev
Copy link
Member

I assume so, will check next week, after PTO

@alexey-bataev
Copy link
Member

Ok, I prepared a fix but looks like even with the fixed version of the code this new vectorized form still looks preferable. The problem is that the SLP vectorizer does not know that this scalar form can be converted just to vhaddpd. So, it calculates the cost of the scalars and consider them as removed in the code (along with the insertelement instructions). And after that still considers the vectorized version of the more profitable

But this would have been even simpler:

define <4 x double> @add_pd_004(<4 x double> noundef %a, <4 x double> noundef %b)  {
entry:
  %0 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 poison, i32 poison, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 1, i32 poison, i32 poison, i32 7>
  %2 = fadd <4 x double> %0, %1
  ret <4 x double> %2
}

The vectorizer currently cannot generate this. It sees 2 insertelement instructions and operates on vectors of 2x because of that.

@alexey-bataev
Copy link
Member

The cost before the fix:
--- !Passed
Pass: slp-vectorizer
Name: VectorizedList
Function: test
Args:

  • String: 'SLP vectorized with cost '
  • Cost: '-4'
  • String: ' and with tree size '
  • TreeSize: '4'
    ...

After the fix:
Pass: slp-vectorizer
Name: VectorizedList
Function: test
Args:

  • String: 'SLP vectorized with cost '
  • Cost: '-2'
  • String: ' and with tree size '
  • TreeSize: '4'

Because there are just 2 vector extracts, they add just 1 and 1 to the cost.

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Oct 29, 2024
… --> "binop (shuffle), (shuffle)"

Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

WIP - still need to add additional test coverage.

Fixes llvm#94546
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Oct 29, 2024
… --> "binop (shuffle), (shuffle)"

Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

WIP - still need to add additional test coverage.

Fixes llvm#94546
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Oct 30, 2024
… --> "binop (shuffle), (shuffle)"

Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

WIP - still need to add additional test coverage.

Fixes llvm#94546
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Oct 30, 2024
… --> "binop (shuffle), (shuffle)"

Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

WIP - still need to add additional test coverage.

Fixes llvm#94546
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Oct 30, 2024
… --> "binop (shuffle), (shuffle)"

Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

WIP - still need to add additional test coverage.

Fixes llvm#94546
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Oct 30, 2024
…"binop (shuffle), (shuffle)"

Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

Fixes llvm#94546
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Oct 30, 2024
…"binop (shuffle), (shuffle)"

Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

Fixes llvm#94546
smallp-o-p pushed a commit to smallp-o-p/llvm-project that referenced this issue Nov 3, 2024
…"binop (shuffle), (shuffle)" (llvm#114101)

Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

Fixes llvm#94546
Fixes llvm#49736
NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this issue Nov 4, 2024
NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this issue Nov 4, 2024
…"binop (shuffle), (shuffle)" (llvm#114101)

Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

Fixes llvm#94546
Fixes llvm#49736
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants