-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in code quality for horizontal add after 70a54bca6f #94546
Comments
Godbolt link: https://godbolt.org/z/71Pnv5T7T |
LLVM IR is better:
Trunc:
Looks like codegen previously recognized the pattern, but currently not |
But this would have been even simpler: define <4 x double> @add_pd_004(<4 x double> noundef %a, <4 x double> noundef %b) {
entry:
%0 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 poison, i32 poison, i32 6>
%1 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 1, i32 poison, i32 poison, i32 7>
%2 = fadd <4 x double> %0, %1
ret <4 x double> %2
} |
Still, looks like some kind of cost issue, will double check in a couple weeks |
Yes, its purely down to the shuffle costs not recognizing that the v4f64 shuffle mask doesn't cross 128-bit lanes so the worst case cost is much higher than necessary |
@alexey-bataev SLP is only calling getShuffleCost 3 times:
But then creates:
The costs of %0 and %1 will be different to CostB / CostC - have we lost the cost of additional extract_subvector someplace? |
I assume so, will check next week, after PTO |
Ok, I prepared a fix but looks like even with the fixed version of the code this new vectorized form still looks preferable. The problem is that the SLP vectorizer does not know that this scalar form can be converted just to vhaddpd. So, it calculates the cost of the scalars and consider them as removed in the code (along with the insertelement instructions). And after that still considers the vectorized version of the more profitable
The vectorizer currently cannot generate this. It sees 2 insertelement instructions and operates on vectors of 2x because of that. |
The cost before the fix:
After the fix:
Because there are just 2 vector extracts, they add just 1 and 1 to the cost. |
… --> "binop (shuffle), (shuffle)" Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. WIP - still need to add additional test coverage. Fixes llvm#94546
… --> "binop (shuffle), (shuffle)" Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. WIP - still need to add additional test coverage. Fixes llvm#94546
… --> "binop (shuffle), (shuffle)" Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. WIP - still need to add additional test coverage. Fixes llvm#94546
… --> "binop (shuffle), (shuffle)" Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. WIP - still need to add additional test coverage. Fixes llvm#94546
… --> "binop (shuffle), (shuffle)" Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. WIP - still need to add additional test coverage. Fixes llvm#94546
…"binop (shuffle), (shuffle)" Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. Fixes llvm#94546
…"binop (shuffle), (shuffle)" Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. Fixes llvm#94546
…"binop (shuffle), (shuffle)" (llvm#114101) Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. Fixes llvm#94546 Fixes llvm#49736
…"binop (shuffle), (shuffle)" (llvm#114101) Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles. Fixes llvm#94546 Fixes llvm#49736
We have an internal test which tests whether the compiler generates horizontal add instructions for certain cases. Recently we noticed that one of the cases, the code generated seems to have gotten worse after a recent change 70a54bc.
Consider the following code:
If compiled with optimizations targeting btver2 (
-S -O2 -march=btver2
), the compiler previously generated the following code:But after 70a54bc, the compiler now is generating worse code:
@alexey-bataev, this was your change, can you take a look to see if there is a way we can avoid the regression in code quality in this case?
The text was updated successfully, but these errors were encountered: