-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Simplify some relop/jtrue related optimizations #14027
Conversation
018cc67
to
bbc477f
Compare
@CarolEidt @pgavlin I keep toying with the idea of lowering more RELOP/JTRUE nodes to CMP/TEST/SETCC/JCC. The main advantage is that various pieces of logic end up concentrated in one place (
The impact on JIT throughput seems to be very small, below 0.05% instructions retired. This is basically below the noise level I get with ETW profiling. And it seems that there are fewer branch mispredictions, but that too is below the noise level. Opinions? Even if this does have impact on JIT throughput would you consider this a worthwhile change due to code simplification? |
It would be interesting to instrument TGU s.t. we can figure out the average search length before it returns. We might be able to get away with limiting the window in which it searches, especially for throughput-oriented scenarios (e.g. minopts/debuggable code).
Can you try getting an instructions retired count using |
Ha ha, I'm still trying to build pin. |
@pgavlin Can you tell me what make/mingw did you use to build pin? It would seem that those makefiles do not work well with mingw at least. They pass compiler parameters using / instead of - and the / gets treated by the mingw shell as if it is a path. I tried fixing the makefile and now building doesn't show any errors anymore, it just hangs. Sheesh. |
Figured it out, I converted one too many I used the icount pin tool and I've got numbers that are similar to what ETW reports but with a smaller (~order of magnitude) standard deviation. Now to analyze the numbers... |
ETW and PIN data here: https://1drv.ms/x/s!Av4baJYSo5pjgrkusSKacdbhZjDttg Both show a 0.03-0.04% increase in instructions retired. Let me see if I can improve this. |
Hrm, the increase is so small that it makes ETW/PIN number completely irrelevant. These additional 135 calls can't possibly turn into a few million instructions. And it's not like I only added code, I also removed code. |
523e66d
to
911f621
Compare
I've improved a few things and now ETW/PIN show a 0.01-0.02% improvement. I'd take that with a grain of salt. Let's just say that it's as fast as the old code. It's probably more useful to look at various counts reported by manual instrumentation (crossgen corelib numbers):
Improvements since the my first attempt:
There are some additional improvements that may be made in the codegen code for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good overall, though it should be squashed & merged.
Have you run diffs?
src/jit/lower.cpp
Outdated
@@ -2153,7 +2142,7 @@ GenTree* Lowering::LowerTailCallViaHelper(GenTreeCall* call, GenTree* callTarget | |||
// be used for ARM as well if support for GT_TEST_EQ/GT_TEST_NE is added). | |||
// - Transform TEST(x, LSH(1, y)) into BT(x, y) (XARCH specific) | |||
|
|||
void Lowering::LowerCompare(GenTree* cmp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function header needs to be updated to describe the return value.
src/jit/lowerxarch.cpp
Outdated
{ | ||
LIR::Use simdUse; | ||
|
||
if (BlockRange().TryGetUse(simdNode, &simdUse) && simdUse.User()->OperIs(GT_EQ, GT_NE) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code needs comments explaining what is being done. The code removed from lower.cpp was pretty well described, and I think we need similar level of detail here.
src/jit/simdcodegenxarch.cpp
Outdated
@@ -2148,8 +2148,10 @@ void CodeGen::genSIMDIntrinsicRelOp(GenTreeSIMD* simdNode) | |||
getEmitter()->emitIns_R_I(INS_cmp, EA_4BYTE, intReg, mask); | |||
} | |||
|
|||
if (targetReg != REG_NA) | |||
if ((simdNode->gtFlags & GTF_SET_FLAGS) == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add an else clause here, asserting that targetReg == REG_NA
. I would actually be more inclined to reverse the sense of this. That is, I would check whether targetReg != REG_NA
, and then assert that GTF_SET_FLAGS
is not set/set in the if and else clause, but I guess they are basically equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the existing if
should stay as is but I have problems getting this to work because the RA insists on allocating a register even if dstCount
is 0. Need to look into this a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is caused by this piece of code in lsra.cpp
Lines 4821 to 4828 in 10c320c
TreeNodeInfoInit(node); | |
// If the node produces an unused value, mark it as a local def-use | |
if (node->IsValue() && node->IsUnusedValue()) | |
{ | |
node->gtLsraInfo.isLocalDefUse = true; | |
node->gtLsraInfo.dstCount = 0; | |
} |
It forces isLocalDefUse
to true
for unused value nodes. That's probably intended to cover the common case of x86 instructions (e.g. add eax, ebx
) but it's not suitable in this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think this is similar to compares, where I had to add an isNoRegCompare
bit to the TreeNodeInfo
to handle the case where you've got a compare that you don't want to allocate a register for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into isNoRegCompare
before but it's purpose seems to be rather different, it's used by "Contain" code to communicate to "TreeNodeInfoInit" code that dstCount
should be 0. But LSRA itself doesn't appear to use isNoRegCompare
in any way so setting it on the SIMD node has no effect.
What's not clear to me is why LSRA forces isLocalDefUse
to true based on IsValue
/IsUnusedValue
instead of relying on the information provided by TreeNodeInfoInit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that the TreeNodeInfoInit
function LSRA calls ends with the following piece of code:
Lines 719 to 725 in 43cf34f
if (tree->IsUnusedValue() && (info->dstCount != 0)) | |
{ | |
info->isLocalDefUse = true; | |
} | |
// We need to be sure that we've set info->srcCount and info->dstCount appropriately | |
assert((info->dstCount < 2) || (tree->IsMultiRegCall() && info->dstCount == MAX_RET_REG_COUNT)); | |
} |
This is very similar to the code I quoted above yet slightly different. Are these 2 pieces of code correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's not clear to me is why LSRA forces isLocalDefUse to true based on IsValue/IsUnusedValue instead of relying on the information provided by TreeNodeInfoInit.
Ultimately, the setting of dstCount
is/should be based on:
- If the node is contained it is 0 (though that's largely irrelevant because the value won't be used)
- If
!IsValue()
it is 0. - If
IsNoRegCompare()
it is 0 - If
IsValue()
it is 1, or more if it is a node that defines multiple registers
There are places outside of LSRA that need to know the number of registers defined by a node. So the plan is that one should be able to determine the dstCount with gtLsraInfo
, which is being eliminated. And in my next round of changes, I'm adding an assert at the end of LinearScan::TreeNodeInfoInit()
:
assert(info->dstCount == tree->GetRegisterDstCount());
Where GetRegisterDstCount()
is a new method that does the above checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look great overall. I had a couple of requests for additional comments.
I would also like to see the instrumenting of TGU in a separate PR, if that's not too much trouble.
(And it would be great to squash the remaining commits).
Thanks, the instrumentation code should have not been committed, looks like forgot to unstage a file. I also need to fix conflicts and merge with ARM64 work. I'm not feeling so well at the moment so I'm not sure I'll be able to finish this in the next couple of days. Besides, you have your own changes that will like conflict with this (especially isNoRegCompare). |
src/jit/lowerxarch.cpp
Outdated
|
||
if (BlockRange().TryGetUse(simdNode, &simdUse) && simdUse.User()->OperIs(GT_EQ, GT_NE) && | ||
simdUse.User()->gtGetOp2()->IsCnsIntOrI()) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above condition mirrors the original code but it appears to be incomplete because it doesn't check that there are no nodes between JTRUE and SIMD (except the relop and its second op) that may change flags.
I wonder if adding more uses of node->ChangeOper(GT_JCC);
GenTreeCC* cc = node->AsCC();
cc->gtCondition = node->OperGet();
cc->gtFlags |= (node->gtFlags & GTF_UNSIGNED);
node->SetOper(GT_CMP); More expensive, yes. But then this code runs only for JTRUE nodes while A better solution would be to add |
e793359
to
a6371be
Compare
@sdmaclea I moved all the code from |
The 3rd commit (Extend flag reuse optimization to all relops) generates diffs:
I was after better throughput and simpler code but this is also a CQ improvement, we have #7566 for this. Sample diffs: sub ecx, dword ptr [rsi+56]
- test ecx, ecx
jle SHORT G_M41791_IG61 - dec eax
+ add eax, -1
jne SHORT G_M51025_IG12 - dec ecx
- test ecx, ecx
+ add ecx, -1
jle SHORT G_M59796_IG05 As seen above this sometimes results in a regression because it prevents |
I squashed the original changes down to 3 commits and added another one for JCMP. They're pretty much independent and I don't think squashing to a single commit is helpful. |
Unfortunately it's not that simple, SETCC/JCC do not currently support floating point conditions. We'll see, I'll have to run more throughput tests to see if such an approach is feasible. For now I changed the lowering of |
@CarolEidt This will conflict with your own changes. Would you prefer to rebase this on top of your ElimLsraInfo branch to avoid conflicts? |
No; after getting to zero diffs with eliminating gtLsraInfo, I found that compile time had actually increased, due to accessing the map twice per node. So, I'm reworking and I expect that it will take some time. I'll review this again tomorrow, and consider your thoughts on |
src/jit/lsraxarch.cpp
Outdated
info->internalFloatCount = 1; | ||
info->setInternalCandidates(this, allSIMDRegs()); | ||
} | ||
if (info->isNoRegCompare) | ||
info->dstCount = 0; | ||
// Codegen of SIMD (in)Equality uses target integer reg only for setting flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs updating, it still mentions the target register even though it is never used.
Hmm, that's unfortunate. But I see that |
Unlike many other relop transforms we do this one is only triggerred by the presence of a conditional branch (JTRUE) so it makes more sense to do it when lowering JTRUE nodes, avoids unnecessary calls to TryGetUse.
Yes, I am hoping to do that eventually (i.e. #7257, the next step after this), but I was hoping to make a smaller increment by breaking into two issues (#7255 then #7257) (and, obviously, without actually making things worse in the meantime). It will still be the case that we'll have to find the use information (TreeNodeInfo now, Def RefPositions/Intervals later) in the map when building the use RefPositions. So I think the work I'm doing now will be directly leveragable to the next step. |
test Windows_NT arm64 Cross Checked Build and Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one suggested comment update
@@ -2156,8 +2146,11 @@ GenTree* Lowering::LowerTailCallViaHelper(GenTreeCall* call, GenTree* callTarget | |||
// - Transform cmp(and(x, y), 0) into test(x, y) (XARCH/Arm64 specific but could | |||
// be used for ARM as well if support for GT_TEST_EQ/GT_TEST_NE is added). | |||
// - Transform TEST(x, LSH(1, y)) into BT(x, y) (XARCH specific) | |||
// - Transform RELOP(OP, 0) into SETCC(OP) or JCC(OP) if OP can set the | |||
// condition flags appropriately (XARCH/ARM64 specific but could be extended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is now handling ARM64, so this comment should be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it already says that ARM64 is handled, only ARM32 left to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; sorry for the confusion.
else // The relop is not used by a JTRUE or it is not used at all. | ||
{ | ||
// Transform the relop node it into a SETCC. If it's not used we could remove | ||
// it completely but that means doing more work to handle a rare case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If/when we support some sort of limited or general data flow on the flags, this would be something we would expect liveness to do, as it is generally responsible for eliminating dead definitions after Lowering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally, the lack of such data flow analysis prevents us from lowering all JTRUEs to JCCs. If we try this then we'll end up with cases where JCCs are removed by liveness but the associated CMPs are not.
Failure was an orthogonal issue fixed in tip. test Windows_NT arm64 Cross Checked Build and Test |
@mikedn - I think this is nearly ready to merge. One question - the changes to the handling of SIMD won't really show up as diffs if anything changed in that codegen (it doesn't look like it should, but ...) because jit-diff uses crossgen. Have you looked at any of the JIT tests, e.g. JIT/SIMD/VectorRelOp.cs, to see whether the codegen is the same? |
Hmm, I ran the first 2 commits through In any case, I did manually check (using corerun) that the generated code looks as expected. |
No, when you crossgen it skips anything with |
Ah, of course. I was confusing this with the more general case of using SSE or AVX instructions. |
Quick example: [MethodImpl(MethodImplOptions.NoInlining)]
static bool Test(Vector<int> x, Vector<int> y) => x != y; G_M1752_IG02:
C4E17D1001 vmovupd ymm0, ymmword ptr[rcx]
C4E17D100A vmovupd ymm1, ymmword ptr[rdx]
C4E17C28D0 vmovaps ymm2, ymm0
C4E16D76D1 vpcmpeqd ymm2, ymm1
C4E17DD7C2 vpmovmskb eax, ymm2
83F8FF cmp eax, -1
0F95C0 setne al
0FB6C0 movzx rax, al
0FB6C0 movzx rax, al Even the redundant movzx is still there. I was tempted to get rid of it but it's really a different issue. |
These optimizations tend to be spread across multiple functions and add more work to common code paths.
SIMD equality is first recognized in
ContainCheckJTrue
and then it needs to be recognized again during codegencoreclr/src/jit/codegenxarch.cpp
Lines 6289 to 6293 in c90a41b
because ContainCheckJTrue did not actually remove the redundant compare. SIMD equality compares are obviously rare and we're trying to recognize them every time we handle a JTRUE or a compare.
Similarly, the case where the condition flags set by a previous instruction can be used instead of emitting a zero test is handled in
ContainCheckCompare
andgenCompareInt
. The later is supposed to recognize such redundant compares by checking theGTF_USE_FLAGS
but that flag is really intended to indicate that the node is consuming the condition flags, not that no code should be generated for it. And like in the SIMD case it's again more work pushed down to a common code path, this optimization kicks in for less than 0.5% of all compares.In both cases lowering can take advantage of CMP, SETCC and JCC to alter the IR in such a way that no special casing is required in subsequent phases.