Improve Bmi2.MultiplyNoFlags codegen #1224

pentp · 2019-12-31T02:05:34Z

This is an updated version of dotnet/coreclr#22047 with more comments.
Fixes #11782
This also fixes containment for MULX that was disabled in dotnet/coreclr#23511.

@CarolEidt could you take a look at this?

BruceForstall · 2020-01-03T00:55:45Z

@dotnet/jit-contrib

stephentoub · 2020-02-17T15:01:48Z

@pentp, @echesakovMSFT, what's the status of this PR?

pentp · 2020-02-17T23:00:55Z

It's ready for review.

I tried to get all tests green twice by force pushing the same changes, but no luck.
Could try again if the test pipeline is more reliable now? (about 80% of PRs still have red X-es)

I've tested it locally using an enhanced decimal.DecCalc implementation that uses MULX.

stephentoub · 2020-03-13T14:35:15Z

@dotnet/jit-contrib, what are the next steps for this PR? Thanks.

CarolEidt · 2020-03-13T16:43:42Z

Sorry to have dropped the ball on this. I'll review this today.

stephentoub · 2020-03-13T17:57:23Z

Thanks, @CarolEidt.

CarolEidt · 2020-03-13T23:39:41Z

src/coreclr/src/jit/hwintrinsicxarch.cpp

+                // If op3 (low part of result) is an address of a local variable of the correct type:
+                //   MultiplyNoFlags(*, *, &someLocal);
+                // Then it is transformed to use a dedicated local for that whose value is immediately copied to the
+                // original op3 local after MultiplyNoFlags (but before producing the main result):


I'm not sure why you say "but before producing the main result". The copy must be done after the result is computed.

Clarified the wording a bit

CarolEidt · 2020-03-13T23:41:00Z

src/coreclr/src/jit/hwintrinsicxarch.cpp

+                    GenTree* mov = gtNewAssignNode(gtNewLclvNode(orgNum, callType), gtNewLclvNode(tmpNum, callType));
+
+                    GenTree* comma = gtNewOperNode(GT_COMMA, callType, mov, mulx);
+                    comma->gtFlags |= GTF_REVERSE_OPS;


Why do you create the COMMA with the 'mov' before the 'mulx' and then set GTF_REVERSE_OPS instead of creating the comma with the mulx first?

My understanding is that mulx must be the last op for COMMA because it will be the value produced from it. But they must be executed in the reverse order.

CarolEidt · 2020-03-13T23:50:24Z

src/coreclr/src/jit/lsraxarch.cpp

+                // If op1 is contained then op2 must be loaded into EDX (also optimize case where it's already in EDX)
+                if (op1->isContained() || (compiler->opts.OptimizationEnabled() && op2->OperGet() == GT_LCL_VAR &&
+                                           compiler->lvaTable[op2->AsLclVar()->GetLclNum()].lvTracked &&
+                                           getIntervalForLocalVarNode(op2->AsLclVar())->physReg == REG_EDX))


This is prior to register allocation, so the only case where the interval already has an assigned physReg is if it's an incoming argument. It would be more efficient (and clearer) to say "(also optimize for the case where it's an incoming arg in EDX)".
Also, a check like this would be clearer if you extracted the LclVarDsc* prior to the check (and use compiler->lvaGetDesc() rather than using lvaTable directly).

Cleaned up this condition.

CarolEidt · 2020-03-14T00:12:02Z

Sorry for the long delay in reviewing this. I'm still not comfortable with it. It couples instructions in a way that is not reflected in the IR, and it involves deleting a definition of the temp without a way to really verify that somehow it has not been reused somewhere. At the very least the nodes for that temp would have to be marked GTF_DONT_CSE. Also, the register lifetimes aren't accurate which seems risky. For example, what if the high result winds up being unused - the register allocator could allocate the same register for it and for the low result, in which case the low result would be overwritten, since in the register allocator's model it appears that the low result is defined after the high one. It could also decide to insert spill or move code between the instructions, which I believe would lead to incorrect results.
I've spent some time pondering this, and I think a better alternative would be to:

define two intrinsics- call them MultiplyNoFlagsHigh and MultiplyNoFlagsLow. If low is not a pointer to a local, we leave it alone. If it is a pointer to a local, we split the intrinsic into the two parts using these new intrinsics.
define a new JIT node called GT_MULX. The reason for this is that we can piggyback it on the existing support for GT_MUL_LONG on 32-bit architectures that defines two registers.
- Note that the front-end of the JIT doesn't really support defining two separate values; in the front end this node produces a single TYP_LONG result until longs are decomposed during Lowering.
during Lowering, transform the intrinsic node(s) into GT_MULX. If the MultiplyNoFlagsHigh and MultiplyNoFlagsLow are adjacent, they can be combined. Otherwise, code is generated to compute the results separately. This allows it to be correct, if sub-optimal.

tannergooding · 2020-03-14T00:18:18Z

@CarolEidt, would such an approach also be applicable to DivRem and BigMul (the latter being long * long = (high, low) with the result representing the two parts of a Int128)?

CarolEidt · 2020-03-14T00:23:11Z

would such an approach also be applicable to DivRem and BigMul (the latter being long * long = (high, low) with the result representing the two parts of a Int128)?

Yes, I think so. We may even want to consider modeling these kinds of intrinsics internally as producing a struct of two values in the JIT front-end, so that we don't have to rely on combining them later. But I think we might be able to do that as a separate step.

pentp · 2020-03-17T19:22:52Z

I added GTF_DONT_CSE to the temp local nodes and the mulx node (are these right?).

Register lifetimes should be accurate - tracking register lifetime is the primary reason for the BMI2_MultiplyNoFlags2ndValue special node.
If either the high or low result ends up being unused, then lowering will not introduce the special value node (!node->IsUnusedValue() and node->gtNext* checks).

I agree that this solution isn't ideal, but all the changes are entirely contained to MULX code paths and this would make MULX actually useful (as it currently does memory accesses and wastes registers, it's actually faster to do 3-4 regular IMULs to compute the same result).

The MultiplyNoFlagsHigh + MultiplyNoFlagsLow alternative looks more complicated as it would require the pair to flow through all the phases from importation to lowering (the MULX low part temp local is also flowed through the same phases, but it doesn't have any special semantics attached to it other than GTF_DONT_CSE). I'm not sure how the input nodes would be duplicated for the two MultiplyNoFlags nodes? Through temp locals? But lowering is done after CSE so they would be forced to go through these even in the best case?

Ideally there should be a way to specify two-value nodes in IR, not just for MULX, but also for DivRem, BigMul, AddWithCarry, etc. Maybe through linked nodes (two adjacent nodes that must not be separated) or something like that?

CarolEidt · 2020-03-17T23:05:47Z

If either the high or low result ends up being unused, then lowering will not introduce the special value node (!node->IsUnusedValue() and node->gtNext* checks).

They could be determined to be unused after Lowering.

tannergooding · 2020-06-01T17:17:25Z

@CarolEidt, what is the state of this PR with relation to the multi-reg return work you are doing?

That is, should the PR move forward with some needed changes or should it be closed as this isn't how we want to handle these long term?

pentp · 2020-06-01T18:19:01Z

Even if we want a more general solution in long term (based on the multi-reg work), this PR could still be used in the meantime as a very specific and contained fix.

They could be determined to be unused after Lowering.

I've tried to think of situations where this could happen (to test out the code), but I'm not sure how they could end up being marked unused after lowering but before LSRA. Even if they end up unused after that, the register lifetimes will still be exact and tracked by the two adjacent nodes.

CarolEidt · 2020-06-03T14:54:28Z

Having looked at this further, I remain concerned. At least the temp won't be optimized away, since it is marked address-taken and so is never tracked.

Here, however, are the issues that make this approach undesirable:

it introduces a temp in the importer, and then depends on the hidden/implicit property that the temp is used only once to ensure that it can be safely removed in Lowering.
it introduces a node during Lowering that implicitly uses the result of the previous intrinsic, with no dependency. Liveness analysis (which runs after Lowering and before register allocation) could determine that the result of the first intrinsic was unused, and then eliminate it.

I've added a note to the original issue, #11782 (comment), with a link to my branch that has a prototype for how we can support this more holistically.

jkotas · 2020-09-17T21:45:13Z

Closing state PRs.

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 31, 2019

BruceForstall requested review from CarolEidt and tannergooding January 3, 2020 00:55

pentp force-pushed the mulx branch 2 times, most recently from 30bfbef to e42cd47 Compare January 16, 2020 01:26

sandreenko requested a review from echesakov January 27, 2020 07:22

pentp force-pushed the mulx branch from e42cd47 to 1e3d95d Compare March 6, 2020 17:21

CarolEidt reviewed Mar 14, 2020

View reviewed changes

pentp added 2 commits March 17, 2020 13:31

Improve Bmi2.MultiplyNoFlags codegen

fe24b75

Address PR feedback

3b8a23f

pentp force-pushed the mulx branch from 0aae563 to 3b8a23f Compare March 17, 2020 18:46

jkotas mentioned this pull request May 23, 2020

BigMul with 64-bit operands #35975

Merged

jkotas closed this Sep 17, 2020

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

pentp deleted the mulx branch July 1, 2023 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Bmi2.MultiplyNoFlags codegen #1224

Improve Bmi2.MultiplyNoFlags codegen #1224

pentp commented Dec 31, 2019 •

edited

Loading

BruceForstall commented Jan 3, 2020

stephentoub commented Feb 17, 2020

pentp commented Feb 17, 2020

stephentoub commented Mar 13, 2020

CarolEidt commented Mar 13, 2020

stephentoub commented Mar 13, 2020

CarolEidt Mar 13, 2020

pentp Mar 17, 2020

CarolEidt Mar 13, 2020

pentp Mar 17, 2020

CarolEidt Mar 13, 2020

pentp Mar 17, 2020

CarolEidt commented Mar 14, 2020

tannergooding commented Mar 14, 2020

CarolEidt commented Mar 14, 2020

pentp commented Mar 17, 2020

CarolEidt commented Mar 17, 2020

tannergooding commented Jun 1, 2020

pentp commented Jun 1, 2020

CarolEidt commented Jun 3, 2020

jkotas commented Sep 17, 2020

Improve Bmi2.MultiplyNoFlags codegen #1224

Improve Bmi2.MultiplyNoFlags codegen #1224

Conversation

pentp commented Dec 31, 2019 • edited Loading

BruceForstall commented Jan 3, 2020

stephentoub commented Feb 17, 2020

pentp commented Feb 17, 2020

stephentoub commented Mar 13, 2020

CarolEidt commented Mar 13, 2020

stephentoub commented Mar 13, 2020

CarolEidt Mar 13, 2020

Choose a reason for hiding this comment

pentp Mar 17, 2020

Choose a reason for hiding this comment

CarolEidt Mar 13, 2020

Choose a reason for hiding this comment

pentp Mar 17, 2020

Choose a reason for hiding this comment

CarolEidt Mar 13, 2020

Choose a reason for hiding this comment

pentp Mar 17, 2020

Choose a reason for hiding this comment

CarolEidt commented Mar 14, 2020

tannergooding commented Mar 14, 2020

CarolEidt commented Mar 14, 2020

pentp commented Mar 17, 2020

CarolEidt commented Mar 17, 2020

tannergooding commented Jun 1, 2020

pentp commented Jun 1, 2020

CarolEidt commented Jun 3, 2020

jkotas commented Sep 17, 2020

pentp commented Dec 31, 2019 •

edited

Loading