Regression in code quality for horizontal add after 70a54bca6f #94546

dyung · 2024-06-05T23:10:19Z

We have an internal test which tests whether the compiler generates horizontal add instructions for certain cases. Recently we noticed that one of the cases, the code generated seems to have gotten worse after a recent change 70a54bc.

Consider the following code:

__attribute__((noinline))
__m256d add_pd_004(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, -1, 3);
}

If compiled with optimizations targeting btver2 (-S -O2 -march=btver2), the compiler previously generated the following code:

        vhaddpd ymm0, ymm0, ymm1
        ret

But after 70a54bc, the compiler now is generating worse code:

        vextractf128    xmm1, ymm1, 1
        vhaddpd xmm0, xmm0, xmm1
        vinsertf128     ymm0, ymm0, xmm0, 1
        ret

@alexey-bataev, this was your change, can you take a look to see if there is a way we can avoid the regression in code quality in this case?

The text was updated successfully, but these errors were encountered:

dyung · 2024-06-05T23:10:50Z

Godbolt link: https://godbolt.org/z/71Pnv5T7T

alexey-bataev · 2024-06-06T00:39:52Z

LLVM IR is better:
18.1.0

define dso_local noundef <4 x double> @add_pd_004(double vector[4], double vector[4])(<4 x double> noundef %a, <4 x double> noundef %b) local_unnamed_addr {
entry:
  %vecext = extractelement <4 x double> %a, i64 0
  %vecext1 = extractelement <4 x double> %a, i64 1
  %add = fadd double %vecext, %vecext1
  %0 = insertelement <4 x double> poison, double %add, i64 0
  %vecext10 = extractelement <4 x double> %b, i64 2
  %vecext11 = extractelement <4 x double> %b, i64 3
  %add12 = fadd double %vecext10, %vecext11
  %shuffle = insertelement <4 x double> %0, double %add12, i64 3
  ret <4 x double> %shuffle
}

Trunc:

define dso_local noundef <4 x double> @add_pd_004(double vector[4], double vector[4])(<4 x double> noundef %a, <4 x double> noundef %b) local_unnamed_addr {
entry:
  %0 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 0, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 1, i32 7>
  %2 = fadd <2 x double> %0, %1
  %3 = shufflevector <2 x double> %2, <2 x double> poison, <4 x i32> <i32 0, i32 poison, i32 poison, i32 1>
  ret <4 x double> %3
}

declare void @llvm.dbg.value(metadata, metadata, metadata) #1

Looks like codegen previously recognized the pattern, but currently not

RKSimon · 2024-06-10T11:09:39Z

But this would have been even simpler:

define <4 x double> @add_pd_004(<4 x double> noundef %a, <4 x double> noundef %b)  {
entry:
  %0 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 poison, i32 poison, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 1, i32 poison, i32 poison, i32 7>
  %2 = fadd <4 x double> %0, %1
  ret <4 x double> %2
}

alexey-bataev · 2024-06-10T11:57:31Z

Still, looks like some kind of cost issue, will double check in a couple weeks

RKSimon · 2024-06-10T12:18:46Z

Yes, its purely down to the shuffle costs not recognizing that the v4f64 shuffle mask doesn't cross 128-bit lanes so the worst case cost is much higher than necessary

RKSimon · 2024-06-18T17:28:27Z

@alexey-bataev SLP is only calling getShuffleCost 3 times:

CostA <4 x double> <i32 0, i32 undef, i32 undef, i32 1> = 2
CostB <2 x double> <i32 0, i32 2> = 1
CostC <2 x double> <i32 1, i32 3> = 1

But then creates:

  %0 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 0, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 1, i32 7>
  %2 = fadd <2 x double> %0, %1
  %3 = shufflevector <2 x double> %2, <2 x double> poison, <4 x i32> <i32 0, i32 poison, i32 poison, i32 1>

The costs of %0 and %1 will be different to CostB / CostC - have we lost the cost of additional extract_subvector someplace?

alexey-bataev · 2024-06-18T17:32:05Z

I assume so, will check next week, after PTO

alexey-bataev · 2024-06-26T18:18:01Z

Ok, I prepared a fix but looks like even with the fixed version of the code this new vectorized form still looks preferable. The problem is that the SLP vectorizer does not know that this scalar form can be converted just to vhaddpd. So, it calculates the cost of the scalars and consider them as removed in the code (along with the insertelement instructions). And after that still considers the vectorized version of the more profitable

But this would have been even simpler:

define <4 x double> @add_pd_004(<4 x double> noundef %a, <4 x double> noundef %b)  {
entry:
  %0 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 poison, i32 poison, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 1, i32 poison, i32 poison, i32 7>
  %2 = fadd <4 x double> %0, %1
  ret <4 x double> %2
}

The vectorizer currently cannot generate this. It sees 2 insertelement instructions and operates on vectors of 2x because of that.

alexey-bataev · 2024-06-26T18:24:13Z

The cost before the fix:
--- !Passed
Pass: slp-vectorizer
Name: VectorizedList
Function: test
Args:

String: 'SLP vectorized with cost '
Cost: '-4'
String: ' and with tree size '
TreeSize: '4'
...

After the fix:
Pass: slp-vectorizer
Name: VectorizedList
Function: test
Args:

String: 'SLP vectorized with cost '
Cost: '-2'
String: ' and with tree size '
TreeSize: '4'

Because there are just 2 vector extracts, they add just 1 and 1 to the cost.

… --> "binop (shuffle), (shuffle)" Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. WIP - still need to add additional test coverage. Fixes llvm#94546

…"binop (shuffle), (shuffle)" Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. Fixes llvm#94546

…"binop (shuffle), (shuffle)" (llvm#114101) Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. Fixes llvm#94546 Fixes llvm#49736

github-actions bot added the new issue label Jun 5, 2024

EugeneZelenko added regression llvm:SLPVectorizer and removed new issue labels Jun 5, 2024

RKSimon self-assigned this Jun 10, 2024

RKSimon mentioned this issue Jun 18, 2024

[SLP]Improve/fix extracts calculations for non-power-of-2 elements. #93213

Merged

RKSimon mentioned this issue Oct 29, 2024

[VectorCombine] Fold "shuffle (binop (shuffle, shuffle)), undef" --> "binop (shuffle), (shuffle)" #114101

Merged

RKSimon added a commit that referenced this issue Oct 30, 2024

[PhaseOrdering][X86] Add test coverage for #94546

bc999ee

RKSimon closed this as completed in #114101 Oct 31, 2024

RKSimon closed this as completed in 92af82a Oct 31, 2024

EugeneZelenko added vectorizers and removed llvm:SLPVectorizer labels Oct 31, 2024

NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this issue Nov 4, 2024

[PhaseOrdering][X86] Add test coverage for llvm#94546

5f8f3e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in code quality for horizontal add after 70a54bca6f #94546

Regression in code quality for horizontal add after 70a54bca6f #94546

dyung commented Jun 5, 2024

dyung commented Jun 5, 2024

alexey-bataev commented Jun 6, 2024

RKSimon commented Jun 10, 2024

alexey-bataev commented Jun 10, 2024

RKSimon commented Jun 10, 2024

RKSimon commented Jun 18, 2024

alexey-bataev commented Jun 18, 2024

alexey-bataev commented Jun 26, 2024

alexey-bataev commented Jun 26, 2024

Regression in code quality for horizontal add after 70a54bca6f #94546

Regression in code quality for horizontal add after 70a54bca6f #94546

Comments

dyung commented Jun 5, 2024

dyung commented Jun 5, 2024

alexey-bataev commented Jun 6, 2024

RKSimon commented Jun 10, 2024

alexey-bataev commented Jun 10, 2024

RKSimon commented Jun 10, 2024

RKSimon commented Jun 18, 2024

alexey-bataev commented Jun 18, 2024

alexey-bataev commented Jun 26, 2024

alexey-bataev commented Jun 26, 2024