-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce repeated variable add operations to a single multiply. #61663
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThis PR addresses #34938. Changes
DiscussionMy Regarding the regression below, esp. with Overall, the improvements are pretty minor here, which isn't surprising as I wouldn't expect programmers writing benchmarks or libraries to do something like Resultsbenchmarks.run.windows.x64.checked.mch:
Detail diffs
coreclr_tests.pmi.windows.x64.checked.mch:
Detail diffs
libraries.pmi.windows.x64.checked.mch:
Detail diffs
libraries_tests.pmi.windows.x64.checked.mch:
Detail diffs
|
Nice! Can you make sure this code is correctly folded: int Test(int x)
{
for (int i = 0; i < Vector<int>.Count; i++)
x++;
return x;
} currently, Also, this case: #46257 (comment) UPD Ah, it's a bit different - these are separate statements. |
Yes, though I am interested in the combination of strength reduction with loop unrolling in general. However, this is definitely a peephole optimization for a single add expression. |
will it also reduce |
No, it will only address replicated addition, at least as a first pass, per the discussion on #34938. I |
The pattern pops up a little bit in the libraries test (please see above), for the regression, please see #61662. @kunalspathak @EgorBo what do you think? Is this peep worth considering to match behavior in LLVM/MSVC (https://godbolt.org/z/4ednM7WaE)? It may be a start for further strength reduction patterns too, i.e., @kasperk81 pattern above. |
It is true that there are not many patterns in .NET libraries that will benefit from this. I would still keep it provided we understand the regression happening in some benchmark methods and try to eliminate those. This happens on both x64/arm64.
|
src/coreclr/jit/gentree.cpp
Outdated
if (vnStore != nullptr) | ||
{ | ||
fgValueNumberTreeConst(consTree); | ||
fgValueNumberTree(morphed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The morphed tree should have the same VN as the original: morphed->SetVNsFromNode(tree)
. It is not a good idea, in general, to call fgValueNumberTree
outside of VN, it has not been tested to work that way.
Also, there is a helper you can use here to update the VN of the constant tree: fgUpdateConstTreeValueNumber
.
Also, if you don't intend to relax the fgGlobalMorph
check under which this method is right now, you don't need to maintain VNs at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. This is mostly my unfamiliarity with the code --- I didn't realize the difference between the fgGlobalMorph
check, and the CSE
check, i.e., when the optimizations are valid in which phases.
Looking at the surrounding code in fgMorphSmpOp
, I see cases where the optimizations are restricted to the global morph, and cases where they are not. Is there a recommended pattern here, or do you have any tips?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the surrounding code in
fgMorphSmpOp
, I see cases where the optimizations are restricted to the global morph, and cases where they are not. Is there a recommended pattern here, or do you have any tips?
It is mostly cost/benefit question.
But first, legality:
- Only optimizations that preserve value numbers are allowed outside of
fgGlobalMorph
. Allowable VN manipulation is transferring an existing VN from an existing tree, and updating VNs for constants. Anything more complex is probably better left underfgGlobalMorph
. This optimization can trivially preserve value numbers (the VN of the tree stays the same, as does the VN of the local, the VN of the constant gets updated).
Allowing optimizations outside of global morph is beneficial because the optimization phases discover new facts that help peepholes, primarily via constant propagation. If your optimization is unlikely to benefit from that, it is likely better to leave it under fgGlobalMorph
and not worry about VN maintenance / CSEs. Otherwise, it is up to the judgement of the implementer and reviewers what decision is "better" CQ-vs-risk-of-bugs-wise (which, I note, we discover on a somewhat regular basis...).
I personally think it is fine to leave this under fgGlobalMorph
and delete the VN code. The only new fact that optimizations can discover and which can help the transformation is a local being substituted via copy propagation, but it seems like a rather rare event, and we do not remorph in copy propagation, so it would take a random CSE/assertion to force the remorph and realize the benefit.
(Heh, I should add this my guide)
src/coreclr/jit/gentree.cpp
Outdated
GenTree* consTree = tree->AsOp()->gtOp1; | ||
consTree->BashToConst(foldCount, lclVarTree->TypeGet()); | ||
|
||
/* GT_MUL operators can later be transformed into 'GT_CALL' */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* GT_MUL operators can later be transformed into 'GT_CALL' */ | |
// GT_MUL operators can later be transformed into 'GT_CALL'. |
Nit: we prefer //
-style comments.
This suggest to me that this is actually a pessimization in this case. Probably we should bail if we see 64 bit ADD
s on a 32 bit target (the only case where we use helpers). We may miss some power of two cases, but this is the price of doing this in pre-order, I suppose.
src/coreclr/jit/gentree.cpp
Outdated
GenTree* op1 = tree->AsOp()->gtOp1; | ||
GenTree* op2 = tree->AsOp()->gtOp2; | ||
|
||
if (!op2->OperIs(GT_LCL_VAR) || !varTypeIsIntegralOrI(op2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!op2->OperIs(GT_LCL_VAR) || !varTypeIsIntegralOrI(op2)) | |
if (!op2->OperIs(GT_LCL_VAR) || !varTypeIsIntegral(op2)) |
It is not valid to multiply byrefs/refs.
src/coreclr/jit/gentree.cpp
Outdated
return tree; | ||
} | ||
|
||
genTreeOps oper = tree->OperGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused?
src/coreclr/jit/gentree.cpp
Outdated
GenTree* Compiler::gtReduceStrength(GenTree* tree) | ||
{ | ||
// ADD(_, V0) starts the pattern match | ||
if (tree->OperGet() != GT_ADD || tree->gtOverflow() || optValnumCSE_phase) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (tree->OperGet() != GT_ADD || tree->gtOverflow() || optValnumCSE_phase) | |
if (!tree->OperIs(GT_ADD) || tree->gtOverflow() || optValnumCSE_phase) |
Nit: why not use OperIs
everywhere?
The optValnumCSE_phase
is redundant with the fgGlobalMorph
guard under which the method is called. I suggest removing it and instead asserting that we are in global morph, if you don't intend to relax that restriction. If you do, then optValnumCSE_phase
should, obviously, stay.
src/coreclr/jit/gentree.cpp
Outdated
// Return Value: | ||
// reduced tree if pattern matches, original tree otherwise | ||
// | ||
GenTree* Compiler::gtReduceStrength(GenTree* tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the method's name is very general - is this intended to be extended? If this is to say as it is, it should probably be renamed to something more specific (there is a pattern of fgOptimizeThing
in morph) and moved out of gentree.cpp
into morph.cpp
(not much of a difference these days, I know, but let's not make it worse 😄).
src/coreclr/jit/gentree.cpp
Outdated
} | ||
|
||
int foldCount = 0; | ||
unsigned int lclNum = op2->AsLclVarCommon()->GetLclNum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned int lclNum = op2->AsLclVarCommon()->GetLclNum(); | |
unsigned lclNum = op2->AsLclVarCommon()->GetLclNum(); |
Nit: we prefer bare unsigned
for lclNum
s.
For the regression (at least in
becomes
for which the I created a separate PR to enhance the I'll investigate the asm diffs generated through the CI superpmi and report back to double check the AMD case. |
src/coreclr/jit/morph.cpp
Outdated
// GT_MUL operators can later be transformed into 'GT_CALL'. | ||
GenTree* morphed = | ||
new (this, GT_CALL) GenTreeOp(GT_MUL, tree->TypeGet(), lclVarTree, consTree DEBUGARG(/*largeNode*/ true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed (gtNewOperNode
can be used instead)?
(This will need a comment update above, where we give up for 64 MUL
s on 32 bit, that it is now for both CQ and correctness reasons)
src/coreclr/jit/compiler.h
Outdated
@@ -6388,6 +6388,9 @@ class Compiler | |||
GenTree* fgMorphCommutative(GenTreeOp* tree); | |||
GenTree* fgMorphCastedBitwiseOp(GenTreeOp* tree); | |||
|
|||
// Reduce successive add operations to a single multiply, i + i + i + i => i * 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Reduce successive add operations to a single multiply, i + i + i + i => i * 4 |
Nit: I don't think it is worth repeating the method header. Such duplication tends to lead to stale comments (as people naturally forget to update both places).
src/coreclr/jit/morph.cpp
Outdated
@@ -11581,6 +11581,14 @@ GenTree* Compiler::fgMorphSmpOp(GenTree* tree, MorphAddrContext* mac) | |||
break; | |||
} | |||
|
|||
// Perform strength reduction prior to preorder op1, op2 operand processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I am not convinced adding the comment here has a lot of value, it is fairly obvious from the code what is going on.
32b6a63
to
3aa1847
Compare
@SingleAccretion I've addressed your comments in the last commits. Should have covered everything. @kunalspathak I spent some time digging into the regressions to better understand why they happen with this peep. The following is a detailed description of the ResultsWindows x64 Benchmarks
Detail diffs
Windows x86 Benchmarks
Detail diffs
Linux x64 Benchmarks
Detail diffs
ExplanationsBenchstone.BenchI.CSieve:Test():bool:thisDue to introduction of a range check that was otherwise eliminated by Benchstone.MDBenchF.MDLLoops:Init():thisInit code has this pattern...
which the peep will transform to (essentially)
which effectively transforms
to
so in this case, the strength reduction interfares with CSE. Benchstone.BenchF.LLoops:Init():thisSimilar as MDLLoops, with the relevant code as
except that the way code is generated on x86, an extra
to
StringSort:strsift(System.String[],int,int)The generated code is effectively the same, even when the
The reason for the change in bytes is that the BitOps:DoBitfieldIteration(System.Int32[],System.Int32[],int,byref):longThe relevant code is
where the transformed code,
to
Note that, if the index expression is written as |
@kunalspathak can I get an update on this? I'm happy to discuss and make further changes. |
Sorry @anthonycanino for not getting to this. I will go through it in 1-2 days and let you know my feedback. Thanks for your patience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor change suggested.
Also I went through the asmdiffs and they looks as expected. There is a scope to improve arm64 code (not in this PR though) that we generate for i * 3
. Today we generate:
mov w7, #3
mul w7, w1, w7
but can be converted to single instruction.
add w0,w8,w8,lsl #1
src/coreclr/jit/morph.cpp
Outdated
|
||
// V0 + V0 ... + V0 becomes V0 * foldCount, where postorder transform will optimize | ||
// accordingly | ||
GenTree* lclVarTree = tree->AsOp()->gtOp2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can you move these definitions before the while
loop and then set op1
, op2
from them?
GenTree* lclVarTree = tree->AsOp()->gtOp2;
GenTree* consTree = tree->AsOp()->gtOp1;
GenTree *op1 = consTree, *op2 = lclVarTree;
while ()
{
}
`gtReduceStrength` will reduce `i + i + i + i` to `i * 4`, which will be optimized to `i << 2`. The reduction is not limited to power of twos, as `i + i + i + i + i` will reduce to `i * 5`, which will be optimized to `lea reg1, [reg2+4*reg2]` etc on xarch.
A `GT_MUL` with `TYP_LONG` must go through the GT_MUL preorder processing on non-64 bit arch. Change forces that to happen by creating a new GenTree node if GT_ADD can be reduced to GT_MUL and then running fgMorphTree on then new node.
Also cleaned up the code per suggestions regarding structure (move to morph.cpp) and fgGlobalOpt check.
3aa1847
to
363a812
Compare
@kunalspathak thanks for the feedback. I've made the change as requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR addresses #34938.
Changes
gtReduceStrength
will reducei + i + i + i
toi * 4
, which will be optimized toi << 2
. The reduction is not limited topower of twos, as
i + i + i + i + i
will reduce toi * 5
, which will be optimized tolea reg1, [reg2+4*reg2]
etc on xarch.Discussion
My
gtReduceStrength
function is invoked insidegtMorphSmpOp
, and kept as its own function for now per the issue discussion on the public issue, but can be inlined inside the preorder processing block ingtMorphSmpOp
.Regarding the regression below, esp. with
Benchstone.BenchI.CSieve:Test():bool:this
, another PR #61662 I have opened will fix that case due to the range checking limitation.Overall, the improvements are pretty minor here, which isn't surprising as I wouldn't expect programmers writing benchmarks or libraries to do something like
i + i + i + i
. However, LLVM/MSVC performs similar optimizations (https://godbolt.org/z/4ednM7WaE).Results
benchmarks.run.windows.x64.checked.mch:
Detail diffs
coreclr_tests.pmi.windows.x64.checked.mch:
Detail diffs
libraries.pmi.windows.x64.checked.mch:
Detail diffs
libraries_tests.pmi.windows.x64.checked.mch:
Detail diffs