Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Bmi2.MultiplyNoFlags codegen #1224

Closed
wants to merge 2 commits into from
Closed

Improve Bmi2.MultiplyNoFlags codegen #1224

wants to merge 2 commits into from

Conversation

pentp
Copy link
Contributor

@pentp pentp commented Dec 31, 2019

This is an updated version of dotnet/coreclr#22047 with more comments.
Fixes #11782
This also fixes containment for MULX that was disabled in dotnet/coreclr#23511.

@CarolEidt could you take a look at this?

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 31, 2019
@BruceForstall
Copy link
Member

@dotnet/jit-contrib

@stephentoub
Copy link
Member

@pentp, @echesakovMSFT, what's the status of this PR?

@pentp
Copy link
Contributor Author

pentp commented Feb 17, 2020

It's ready for review.

I tried to get all tests green twice by force pushing the same changes, but no luck.
Could try again if the test pipeline is more reliable now? (about 80% of PRs still have red X-es)

I've tested it locally using an enhanced decimal.DecCalc implementation that uses MULX.

@stephentoub
Copy link
Member

@dotnet/jit-contrib, what are the next steps for this PR? Thanks.

@CarolEidt
Copy link
Contributor

Sorry to have dropped the ball on this. I'll review this today.

@stephentoub
Copy link
Member

Thanks, @CarolEidt.

// If op3 (low part of result) is an address of a local variable of the correct type:
// MultiplyNoFlags(*, *, &someLocal);
// Then it is transformed to use a dedicated local for that whose value is immediately copied to the
// original op3 local after MultiplyNoFlags (but before producing the main result):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you say "but before producing the main result". The copy must be done after the result is computed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the wording a bit

GenTree* mov = gtNewAssignNode(gtNewLclvNode(orgNum, callType), gtNewLclvNode(tmpNum, callType));

GenTree* comma = gtNewOperNode(GT_COMMA, callType, mov, mulx);
comma->gtFlags |= GTF_REVERSE_OPS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you create the COMMA with the 'mov' before the 'mulx' and then set GTF_REVERSE_OPS instead of creating the comma with the mulx first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that mulx must be the last op for COMMA because it will be the value produced from it. But they must be executed in the reverse order.

// If op1 is contained then op2 must be loaded into EDX (also optimize case where it's already in EDX)
if (op1->isContained() || (compiler->opts.OptimizationEnabled() && op2->OperGet() == GT_LCL_VAR &&
compiler->lvaTable[op2->AsLclVar()->GetLclNum()].lvTracked &&
getIntervalForLocalVarNode(op2->AsLclVar())->physReg == REG_EDX))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is prior to register allocation, so the only case where the interval already has an assigned physReg is if it's an incoming argument. It would be more efficient (and clearer) to say "(also optimize for the case where it's an incoming arg in EDX)".
Also, a check like this would be clearer if you extracted the LclVarDsc* prior to the check (and use compiler->lvaGetDesc() rather than using lvaTable directly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up this condition.

@CarolEidt
Copy link
Contributor

Sorry for the long delay in reviewing this. I'm still not comfortable with it. It couples instructions in a way that is not reflected in the IR, and it involves deleting a definition of the temp without a way to really verify that somehow it has not been reused somewhere. At the very least the nodes for that temp would have to be marked GTF_DONT_CSE. Also, the register lifetimes aren't accurate which seems risky. For example, what if the high result winds up being unused - the register allocator could allocate the same register for it and for the low result, in which case the low result would be overwritten, since in the register allocator's model it appears that the low result is defined after the high one. It could also decide to insert spill or move code between the instructions, which I believe would lead to incorrect results.
I've spent some time pondering this, and I think a better alternative would be to:

  • define two intrinsics- call them MultiplyNoFlagsHigh and MultiplyNoFlagsLow. If low is not a pointer to a local, we leave it alone. If it is a pointer to a local, we split the intrinsic into the two parts using these new intrinsics.
  • define a new JIT node called GT_MULX. The reason for this is that we can piggyback it on the existing support for GT_MUL_LONG on 32-bit architectures that defines two registers.
    • Note that the front-end of the JIT doesn't really support defining two separate values; in the front end this node produces a single TYP_LONG result until longs are decomposed during Lowering.
  • during Lowering, transform the intrinsic node(s) into GT_MULX. If the MultiplyNoFlagsHigh and MultiplyNoFlagsLow are adjacent, they can be combined. Otherwise, code is generated to compute the results separately. This allows it to be correct, if sub-optimal.

@tannergooding
Copy link
Member

@CarolEidt, would such an approach also be applicable to DivRem and BigMul (the latter being long * long = (high, low) with the result representing the two parts of a Int128)?

@CarolEidt
Copy link
Contributor

would such an approach also be applicable to DivRem and BigMul (the latter being long * long = (high, low) with the result representing the two parts of a Int128)?

Yes, I think so. We may even want to consider modeling these kinds of intrinsics internally as producing a struct of two values in the JIT front-end, so that we don't have to rely on combining them later. But I think we might be able to do that as a separate step.

@pentp
Copy link
Contributor Author

pentp commented Mar 17, 2020

I added GTF_DONT_CSE to the temp local nodes and the mulx node (are these right?).

Register lifetimes should be accurate - tracking register lifetime is the primary reason for the BMI2_MultiplyNoFlags2ndValue special node.
If either the high or low result ends up being unused, then lowering will not introduce the special value node (!node->IsUnusedValue() and node->gtNext* checks).

I agree that this solution isn't ideal, but all the changes are entirely contained to MULX code paths and this would make MULX actually useful (as it currently does memory accesses and wastes registers, it's actually faster to do 3-4 regular IMULs to compute the same result).

The MultiplyNoFlagsHigh + MultiplyNoFlagsLow alternative looks more complicated as it would require the pair to flow through all the phases from importation to lowering (the MULX low part temp local is also flowed through the same phases, but it doesn't have any special semantics attached to it other than GTF_DONT_CSE). I'm not sure how the input nodes would be duplicated for the two MultiplyNoFlags nodes? Through temp locals? But lowering is done after CSE so they would be forced to go through these even in the best case?

Ideally there should be a way to specify two-value nodes in IR, not just for MULX, but also for DivRem, BigMul, AddWithCarry, etc. Maybe through linked nodes (two adjacent nodes that must not be separated) or something like that?

@CarolEidt
Copy link
Contributor

If either the high or low result ends up being unused, then lowering will not introduce the special value node (!node->IsUnusedValue() and node->gtNext* checks).

They could be determined to be unused after Lowering.

@tannergooding
Copy link
Member

@CarolEidt, what is the state of this PR with relation to the multi-reg return work you are doing?

That is, should the PR move forward with some needed changes or should it be closed as this isn't how we want to handle these long term?

@pentp
Copy link
Contributor Author

pentp commented Jun 1, 2020

Even if we want a more general solution in long term (based on the multi-reg work), this PR could still be used in the meantime as a very specific and contained fix.

They could be determined to be unused after Lowering.

I've tried to think of situations where this could happen (to test out the code), but I'm not sure how they could end up being marked unused after lowering but before LSRA. Even if they end up unused after that, the register lifetimes will still be exact and tracked by the two adjacent nodes.

@CarolEidt
Copy link
Contributor

Having looked at this further, I remain concerned. At least the temp won't be optimized away, since it is marked address-taken and so is never tracked.

Here, however, are the issues that make this approach undesirable:

  • it introduces a temp in the importer, and then depends on the hidden/implicit property that the temp is used only once to ensure that it can be safely removed in Lowering.
  • it introduces a node during Lowering that implicitly uses the result of the previous intrinsic, with no dependency. Liveness analysis (which runs after Lowering and before register allocation) could determine that the result of the first intrinsic was unused, and then eliminate it.

I've added a note to the original issue, #11782 (comment), with a link to my branch that has a prototype for how we can support this more holistically.

@jkotas
Copy link
Member

jkotas commented Sep 17, 2020

Closing state PRs.

@jkotas jkotas closed this Sep 17, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
@pentp pentp deleted the mulx branch July 1, 2023 18:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bmi2.MultiplyNoFlags issues
7 participants