Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Simplify some relop/jtrue related optimizations #14027

Merged
merged 4 commits into from
Oct 3, 2017

Conversation

mikedn
Copy link

@mikedn mikedn commented Sep 17, 2017

These optimizations tend to be spread across multiple functions and add more work to common code paths.

SIMD equality is first recognized in ContainCheckJTrue and then it needs to be recognized again during codegen

if ((targetReg == REG_NA) && tree->OperIs(GT_EQ, GT_NE))
{
// Is it a SIMD (in)Equality that doesn't need to materialize result into a register?
if ((op1->gtRegNum == REG_NA) && op1->IsSIMDEqualityOrInequality())
{

because ContainCheckJTrue did not actually remove the redundant compare. SIMD equality compares are obviously rare and we're trying to recognize them every time we handle a JTRUE or a compare.

Similarly, the case where the condition flags set by a previous instruction can be used instead of emitting a zero test is handled in ContainCheckCompare and genCompareInt. The later is supposed to recognize such redundant compares by checking the GTF_USE_FLAGS but that flag is really intended to indicate that the node is consuming the condition flags, not that no code should be generated for it. And like in the SIMD case it's again more work pushed down to a common code path, this optimization kicks in for less than 0.5% of all compares.

In both cases lowering can take advantage of CMP, SETCC and JCC to alter the IR in such a way that no special casing is required in subsequent phases.

@mikedn
Copy link
Author

mikedn commented Sep 18, 2017

@CarolEidt @pgavlin I keep toying with the idea of lowering more RELOP/JTRUE nodes to CMP/TEST/SETCC/JCC. The main advantage is that various pieces of logic end up concentrated in one place (LowerCompare or LowerSIMD for SIMD equality/inequality) instead of spanning lowering/tree node info init/codegen. The main disadvantage would be that we need to use TryGetUse more often. However:

  • TryGetUse should be pretty fast in this case as the use typically follows the relop
  • In the case of LowerSIMD it doesn't matter as it's not common to have SIMD == and !=
  • When a JTRUE/JCC uses the flags set by a previous instruction the RELOP/CMP node gets removed, so there are less nodes to process in subsequent phases

The impact on JIT throughput seems to be very small, below 0.05% instructions retired. This is basically below the noise level I get with ETW profiling. And it seems that there are fewer branch mispredictions, but that too is below the noise level.

Opinions? Even if this does have impact on JIT throughput would you consider this a worthwhile change due to code simplification?

@pgavlin
Copy link

pgavlin commented Sep 19, 2017

The main disadvantage would be that we need to use TryGetUse more often. However:

It would be interesting to instrument TGU s.t. we can figure out the average search length before it returns. We might be able to get away with limiting the window in which it searches, especially for throughput-oriented scenarios (e.g. minopts/debuggable code).

Opinions? Even if this does have impact on JIT throughput would you consider this a worthwhile change due to code simplification?

Can you try getting an instructions retired count using pin or callgrind? If the impact is as low as you say, then I'd imagine that this could be worth it.

@mikedn
Copy link
Author

mikedn commented Sep 19, 2017

Can you try getting an instructions retired count using pin or callgrind?

Ha ha, I'm still trying to build pin.

@mikedn
Copy link
Author

mikedn commented Sep 20, 2017

@pgavlin Can you tell me what make/mingw did you use to build pin? It would seem that those makefiles do not work well with mingw at least. They pass compiler parameters using / instead of - and the / gets treated by the mingw shell as if it is a path. I tried fixing the makefile and now building doesn't show any errors anymore, it just hangs. Sheesh.

@mikedn
Copy link
Author

mikedn commented Sep 20, 2017

I tried fixing the makefile and now building doesn't show any errors anymore, it just hangs

Figured it out, I converted one too many / into -.

I used the icount pin tool and I've got numbers that are similar to what ETW reports but with a smaller (~order of magnitude) standard deviation. Now to analyze the numbers...

@mikedn
Copy link
Author

mikedn commented Sep 20, 2017

ETW and PIN data here: https://1drv.ms/x/s!Av4baJYSo5pjgrkusSKacdbhZjDttg

Both show a 0.03-0.04% increase in instructions retired. Let me see if I can improve this.

@mikedn
Copy link
Author

mikedn commented Sep 20, 2017

It would be interesting to instrument TGU s.t. we can figure out the average search length before it returns.

Function Before After Increase
LIR::TryGetUse 108590 108725 135
GenTree::TryGetUse 152056 152191 135
Ratio 1.40027 1.39977

Hrm, the increase is so small that it makes ETW/PIN number completely irrelevant. These additional 135 calls can't possibly turn into a few million instructions. And it's not like I only added code, I also removed code.

@mikedn
Copy link
Author

mikedn commented Sep 25, 2017

I've improved a few things and now ETW/PIN show a 0.01-0.02% improvement. I'd take that with a grain of salt. Let's just say that it's as fast as the old code.

It's probably more useful to look at various counts reported by manual instrumentation (crossgen corelib numbers):

Code Count
Lowering::LowerCompare 60011
LIR::TryGetUse 109801
"Flags reuse" optimization 249 (0.4% of compares)
"Flags reuse" calls to LIR::TryGetUse 14 (0.01% of all TryGetUse calls)

Improvements since the my first attempt:

  • The common case (95%) of GT_JTRUE immediately following a relop is handled without calling TryGetUse.
  • Got rid of GTF_ZSF_SET. That flag was set on ALL nodes that may trigger this optimization but only a few nodes will actually be used by a relop. It's preferable to simply use OperIs(GT_AND, GT_OR, GT_XOR, GT_ADD, GT_SUB) in LowerCompare than to waste time on all GT_AND & co. nodes.
  • Extended the optimization to all relops. It was somewhat arbitrarily limited to EQ/NE and that required yet another conditional branch.

There are some additional improvements that may be made in the codegen code for JCC/SETCC (it's rather convoluted) but I'll leave that for another PR, if any).

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good overall, though it should be squashed & merged.
Have you run diffs?

@@ -2153,7 +2142,7 @@ GenTree* Lowering::LowerTailCallViaHelper(GenTreeCall* call, GenTree* callTarget
// be used for ARM as well if support for GT_TEST_EQ/GT_TEST_NE is added).
// - Transform TEST(x, LSH(1, y)) into BT(x, y) (XARCH specific)

void Lowering::LowerCompare(GenTree* cmp)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function header needs to be updated to describe the return value.

{
LIR::Use simdUse;

if (BlockRange().TryGetUse(simdNode, &simdUse) && simdUse.User()->OperIs(GT_EQ, GT_NE) &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code needs comments explaining what is being done. The code removed from lower.cpp was pretty well described, and I think we need similar level of detail here.

@@ -2148,8 +2148,10 @@ void CodeGen::genSIMDIntrinsicRelOp(GenTreeSIMD* simdNode)
getEmitter()->emitIns_R_I(INS_cmp, EA_4BYTE, intReg, mask);
}

if (targetReg != REG_NA)
if ((simdNode->gtFlags & GTF_SET_FLAGS) == 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add an else clause here, asserting that targetReg == REG_NA. I would actually be more inclined to reverse the sense of this. That is, I would check whether targetReg != REG_NA, and then assert that GTF_SET_FLAGS is not set/set in the if and else clause, but I guess they are basically equivalent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the existing if should stay as is but I have problems getting this to work because the RA insists on allocating a register even if dstCount is 0. Need to look into this a bit more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is caused by this piece of code in lsra.cpp

coreclr/src/jit/lsra.cpp

Lines 4821 to 4828 in 10c320c

TreeNodeInfoInit(node);
// If the node produces an unused value, mark it as a local def-use
if (node->IsValue() && node->IsUnusedValue())
{
node->gtLsraInfo.isLocalDefUse = true;
node->gtLsraInfo.dstCount = 0;
}

It forces isLocalDefUse to true for unused value nodes. That's probably intended to cover the common case of x86 instructions (e.g. add eax, ebx) but it's not suitable in this situation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think this is similar to compares, where I had to add an isNoRegCompare bit to the TreeNodeInfo to handle the case where you've got a compare that you don't want to allocate a register for.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into isNoRegCompare before but it's purpose seems to be rather different, it's used by "Contain" code to communicate to "TreeNodeInfoInit" code that dstCount should be 0. But LSRA itself doesn't appear to use isNoRegCompare in any way so setting it on the SIMD node has no effect.

What's not clear to me is why LSRA forces isLocalDefUse to true based on IsValue/IsUnusedValue instead of relying on the information provided by TreeNodeInfoInit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that the TreeNodeInfoInit function LSRA calls ends with the following piece of code:

if (tree->IsUnusedValue() && (info->dstCount != 0))
{
info->isLocalDefUse = true;
}
// We need to be sure that we've set info->srcCount and info->dstCount appropriately
assert((info->dstCount < 2) || (tree->IsMultiRegCall() && info->dstCount == MAX_RET_REG_COUNT));
}

This is very similar to the code I quoted above yet slightly different. Are these 2 pieces of code correct?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's not clear to me is why LSRA forces isLocalDefUse to true based on IsValue/IsUnusedValue instead of relying on the information provided by TreeNodeInfoInit.

Ultimately, the setting of dstCount is/should be based on:

  • If the node is contained it is 0 (though that's largely irrelevant because the value won't be used)
  • If !IsValue() it is 0.
  • If IsNoRegCompare() it is 0
  • If IsValue() it is 1, or more if it is a node that defines multiple registers

There are places outside of LSRA that need to know the number of registers defined by a node. So the plan is that one should be able to determine the dstCount with gtLsraInfo, which is being eliminated. And in my next round of changes, I'm adding an assert at the end of LinearScan::TreeNodeInfoInit():

    assert(info->dstCount == tree->GetRegisterDstCount());

Where GetRegisterDstCount() is a new method that does the above checks.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look great overall. I had a couple of requests for additional comments.
I would also like to see the instrumenting of TGU in a separate PR, if that's not too much trouble.
(And it would be great to squash the remaining commits).

@mikedn
Copy link
Author

mikedn commented Sep 27, 2017

Thanks, the instrumentation code should have not been committed, looks like forgot to unstage a file. I also need to fix conflicts and merge with ARM64 work.

I'm not feeling so well at the moment so I'm not sure I'll be able to finish this in the next couple of days. Besides, you have your own changes that will like conflict with this (especially isNoRegCompare).


if (BlockRange().TryGetUse(simdNode, &simdUse) && simdUse.User()->OperIs(GT_EQ, GT_NE) &&
simdUse.User()->gtGetOp2()->IsCnsIntOrI())
{
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above condition mirrors the original code but it appears to be incomplete because it doesn't check that there are no nodes between JTRUE and SIMD (except the relop and its second op) that may change flags.

@mikedn
Copy link
Author

mikedn commented Sep 29, 2017

I wonder if adding more uses of isNoRegCompare is a good idea. The only other use of this flag is in ContainCheckJTrue and there we can replace it with something like:

    node->ChangeOper(GT_JCC);
    GenTreeCC* cc = node->AsCC();
    cc->gtCondition = node->OperGet();
    cc->gtFlags |= (node->gtFlags & GTF_UNSIGNED);
    node->SetOper(GT_CMP);

More expensive, yes. But then this code runs only for JTRUE nodes while isNoRegCompare has to be tested on every used value, these are far more common than JTRUE nodes.

A better solution would be to add GT_SIMD_CMP - a node similar to GT_CMP that doesn't produce a value and sets the flags. Then there would be no need for isNoRegCompare.

@mikedn mikedn force-pushed the simd-eq-opt branch 2 times, most recently from e793359 to a6371be Compare October 1, 2017 09:06
@mikedn
Copy link
Author

mikedn commented Oct 1, 2017

@sdmaclea I moved all the code from ContainCheckCompare, see 2nd and 4th commits. It would be nice to run an ARM64 build to be sure it works fine.

@mikedn mikedn changed the title [WIP] Simplify SIMD EQ/NE optimization [WIP] Simplify some relop/jtrue related optimizations Oct 1, 2017
@mikedn
Copy link
Author

mikedn commented Oct 1, 2017

Have you run diffs?

The 3rd commit (Extend flag reuse optimization to all relops) generates diffs:

Total bytes of diff: -345 (0.00% of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file improvements by size (bytes):
        -133 : System.Private.CoreLib.dasm (0.00% of base)
        -116 : System.Text.RegularExpressions.dasm (-0.12% of base)
         -27 : Microsoft.CSharp.dasm (-0.01% of base)
         -13 : Microsoft.CodeAnalysis.CSharp.dasm (0.00% of base)
         -12 : System.Collections.Concurrent.dasm (-0.02% of base)
14 total files with size differences (14 improved, 0 regressed), 65 unchanged.
Top method regessions by size (bytes):
           1 : System.Private.CoreLib.dasm - Task:FinishSlow(bool):this
           1 : System.Private.CoreLib.dasm - Task:ProcessChildCompletion(ref):this
           1 : System.Private.CoreLib.dasm - Task:WaitAllBlockingCore(ref,int,struct):bool
           1 : System.Private.CoreLib.dasm - SetOnCountdownMres:Invoke(ref):this
           1 : System.Private.CoreLib.dasm - WhenAllPromise:Invoke(ref):this
Top method improvements by size (bytes):
         -38 : System.Private.CoreLib.dasm - GenericArraySortHelper`1:PickPivotAndPartition(ref,int,int):int (18 methods)
         -22 : System.Text.RegularExpressions.dasm - RegexParser:ScanCharClass(bool,bool):ref:this
         -20 : System.Private.CoreLib.dasm - GenericArraySortHelper`1:DownHeap(ref,int,int,int) (18 methods)
         -18 : System.Text.RegularExpressions.dasm - RegexParser:ScanGroupOpen():ref:this
         -15 : System.Private.CoreLib.dasm - GenericArraySortHelper`1:SwapIfGreaterWithItems(ref,int,int) (18 methods)
88 total methods with size differences (80 improved, 8 regressed), 66865 unchanged.

I was after better throughput and simpler code but this is also a CQ improvement, we have #7566 for this.

Sample diffs:

  sub      ecx, dword ptr [rsi+56]
- test     ecx, ecx
  jle      SHORT G_M41791_IG61
- dec      eax
+ add      eax, -1
  jne      SHORT G_M51025_IG12
- dec      ecx
- test     ecx, ecx
+ add      ecx, -1
  jle      SHORT G_M59796_IG05

As seen above this sometimes results in a regression because it prevents ADD(x, +/-1) from being transformed into INC/DEC. This is because INC/DEC instructions do not set the CF flag that is required by many relops other than EQ/NE. This is ultimately a compromise between sometimes wasting a code byte versus having more complex means to communicate to codegen what condition flags are actually needed.

@mikedn
Copy link
Author

mikedn commented Oct 1, 2017

And it would be great to squash the remaining commits

I squashed the original changes down to 3 commits and added another one for JCMP. They're pretty much independent and I don't think squashing to a single commit is helpful.

@mikedn
Copy link
Author

mikedn commented Oct 1, 2017

I wonder if adding more uses of isNoRegCompare is a good idea. The only other use of this flag is in ContainCheckJTrue and there we can replace it with something like:

Unfortunately it's not that simple, SETCC/JCC do not currently support floating point conditions. We'll see, I'll have to run more throughput tests to see if such an approach is feasible. For now I changed the lowering of SIMD<OpEquality|OpInEquality> so that it always sets the condition flags and never produces a value. The 0/1 value, if needed, is produced via a SETCC.

@mikedn
Copy link
Author

mikedn commented Oct 1, 2017

@CarolEidt This will conflict with your own changes. Would you prefer to rebase this on top of your ElimLsraInfo branch to avoid conflicts?

@CarolEidt
Copy link

This will conflict with your own changes. Would you prefer to rebase this on top of your ElimLsraInfo branch to avoid conflicts?

No; after getting to zero diffs with eliminating gtLsraInfo, I found that compile time had actually increased, due to accessing the map twice per node. So, I'm reworking and I expect that it will take some time. I'll review this again tomorrow, and consider your thoughts on NoRegCompare. But I wouldn't hold off for my changes at this point.

info->internalFloatCount = 1;
info->setInternalCandidates(this, allSIMDRegs());
}
if (info->isNoRegCompare)
info->dstCount = 0;
// Codegen of SIMD (in)Equality uses target integer reg only for setting flags.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs updating, it still mentions the target register even though it is never used.

@mikedn
Copy link
Author

mikedn commented Oct 2, 2017

I found that compile time had actually increased, due to accessing the map twice per node

Hmm, that's unfortunate. But I see that TreeNodeInfo still exists, wasn't the initial idea to completely remove it? I was hoping that instead of setting things like dstCount TreeNodeInfoInit* functions would simply build the necessary ref positions to avoid doing intermediary setup work.

Unlike many other relop transforms we do this one is only triggerred by the presence of a conditional branch (JTRUE) so it makes more sense to do it when lowering JTRUE nodes, avoids unnecessary calls to TryGetUse.
@CarolEidt
Copy link

I was hoping that instead of setting things like dstCount TreeNodeInfoInit* functions would simply build the necessary ref positions to avoid doing intermediary setup work.

Yes, I am hoping to do that eventually (i.e. #7257, the next step after this), but I was hoping to make a smaller increment by breaking into two issues (#7255 then #7257) (and, obviously, without actually making things worse in the meantime).

It will still be the case that we'll have to find the use information (TreeNodeInfo now, Def RefPositions/Intervals later) in the map when building the use RefPositions. So I think the work I'm doing now will be directly leveragable to the next step.

@sdmaclea
Copy link

sdmaclea commented Oct 2, 2017

test Windows_NT arm64 Cross Checked Build and Test

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one suggested comment update

@@ -2156,8 +2146,11 @@ GenTree* Lowering::LowerTailCallViaHelper(GenTreeCall* call, GenTree* callTarget
// - Transform cmp(and(x, y), 0) into test(x, y) (XARCH/Arm64 specific but could
// be used for ARM as well if support for GT_TEST_EQ/GT_TEST_NE is added).
// - Transform TEST(x, LSH(1, y)) into BT(x, y) (XARCH specific)
// - Transform RELOP(OP, 0) into SETCC(OP) or JCC(OP) if OP can set the
// condition flags appropriately (XARCH/ARM64 specific but could be extended

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is now handling ARM64, so this comment should be updated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it already says that ARM64 is handled, only ARM32 left to do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; sorry for the confusion.

else // The relop is not used by a JTRUE or it is not used at all.
{
// Transform the relop node it into a SETCC. If it's not used we could remove
// it completely but that means doing more work to handle a rare case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If/when we support some sort of limited or general data flow on the flags, this would be something we would expect liveness to do, as it is generally responsible for eliminating dead definitions after Lowering

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally, the lack of such data flow analysis prevents us from lowering all JTRUEs to JCCs. If we try this then we'll end up with cases where JCCs are removed by liveness but the associated CMPs are not.

@CarolEidt
Copy link

@sdmaclea @jashook - do you have any clues about the arm64 failure? It doesn't appear to be JIT-related.

@sdmaclea
Copy link

sdmaclea commented Oct 3, 2017

Failure was an orthogonal issue fixed in tip.

test Windows_NT arm64 Cross Checked Build and Test

@CarolEidt
Copy link

@mikedn - I think this is nearly ready to merge. One question - the changes to the handling of SIMD won't really show up as diffs if anything changed in that codegen (it doesn't look like it should, but ...) because jit-diff uses crossgen. Have you looked at any of the JIT tests, e.g. JIT/SIMD/VectorRelOp.cs, to see whether the codegen is the same?

@mikedn
Copy link
Author

mikedn commented Oct 3, 2017

Have you looked at any of the JIT tests, e.g. JIT/SIMD/VectorRelOp.cs, to see whether the codegen is the same?

Hmm, I ran the first 2 commits through jit-diff --tests so SIMD differences should have appeared if present. AFAIK crossgen does handle SIMD, the only difference from normal jitting is that it doesn't do AVX, only SSE, right?

In any case, I did manually check (using corerun) that the generated code looks as expected.

@CarolEidt
Copy link

crossgen does handle SIMD, the only difference from normal jitting is that it doesn't do AVX, only SSE, right?

No, when you crossgen it skips anything with Vector<T> since the size can't be determined until runtime. But if you did manual checks, that's fine. I'll try to run desktop diffs (since they run in JIT mode, not crossgen), but I wouldn't wait on that.

@mikedn
Copy link
Author

mikedn commented Oct 3, 2017

No, when you crossgen it skips anything with Vector since the size can't be determined until runtime.

Ah, of course. I was confusing this with the more general case of using SSE or AVX instructions.

@mikedn
Copy link
Author

mikedn commented Oct 3, 2017

Quick example:

[MethodImpl(MethodImplOptions.NoInlining)]
static bool Test(Vector<int> x, Vector<int> y) => x != y;
G_M1752_IG02:
       C4E17D1001           vmovupd  ymm0, ymmword ptr[rcx]
       C4E17D100A           vmovupd  ymm1, ymmword ptr[rdx]
       C4E17C28D0           vmovaps  ymm2, ymm0
       C4E16D76D1           vpcmpeqd ymm2, ymm1
       C4E17DD7C2           vpmovmskb eax, ymm2
       83F8FF               cmp      eax, -1
       0F95C0               setne    al
       0FB6C0               movzx    rax, al
       0FB6C0               movzx    rax, al

Even the redundant movzx is still there. I was tempted to get rid of it but it's really a different issue.

@mikedn mikedn changed the title [WIP] Simplify some relop/jtrue related optimizations Simplify some relop/jtrue related optimizations Oct 3, 2017
@CarolEidt CarolEidt merged commit a27c269 into dotnet:master Oct 3, 2017
@mikedn mikedn deleted the simd-eq-opt branch December 16, 2017 09:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants