-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Bmi2.MultiplyNoFlags codegen #1224
Conversation
@dotnet/jit-contrib |
30bfbef
to
e42cd47
Compare
@pentp, @echesakovMSFT, what's the status of this PR? |
It's ready for review. I tried to get all tests green twice by force pushing the same changes, but no luck. I've tested it locally using an enhanced decimal.DecCalc implementation that uses MULX. |
@dotnet/jit-contrib, what are the next steps for this PR? Thanks. |
Sorry to have dropped the ball on this. I'll review this today. |
Thanks, @CarolEidt. |
// If op3 (low part of result) is an address of a local variable of the correct type: | ||
// MultiplyNoFlags(*, *, &someLocal); | ||
// Then it is transformed to use a dedicated local for that whose value is immediately copied to the | ||
// original op3 local after MultiplyNoFlags (but before producing the main result): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why you say "but before producing the main result". The copy must be done after the result is computed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified the wording a bit
GenTree* mov = gtNewAssignNode(gtNewLclvNode(orgNum, callType), gtNewLclvNode(tmpNum, callType)); | ||
|
||
GenTree* comma = gtNewOperNode(GT_COMMA, callType, mov, mulx); | ||
comma->gtFlags |= GTF_REVERSE_OPS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you create the COMMA
with the 'mov' before the 'mulx' and then set GTF_REVERSE_OPS
instead of creating the comma with the mulx
first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that mulx
must be the last op for COMMA
because it will be the value produced from it. But they must be executed in the reverse order.
src/coreclr/src/jit/lsraxarch.cpp
Outdated
// If op1 is contained then op2 must be loaded into EDX (also optimize case where it's already in EDX) | ||
if (op1->isContained() || (compiler->opts.OptimizationEnabled() && op2->OperGet() == GT_LCL_VAR && | ||
compiler->lvaTable[op2->AsLclVar()->GetLclNum()].lvTracked && | ||
getIntervalForLocalVarNode(op2->AsLclVar())->physReg == REG_EDX)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is prior to register allocation, so the only case where the interval already has an assigned physReg
is if it's an incoming argument. It would be more efficient (and clearer) to say "(also optimize for the case where it's an incoming arg in EDX)".
Also, a check like this would be clearer if you extracted the LclVarDsc*
prior to the check (and use compiler->lvaGetDesc()
rather than using lvaTable
directly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up this condition.
Sorry for the long delay in reviewing this. I'm still not comfortable with it. It couples instructions in a way that is not reflected in the IR, and it involves deleting a definition of the temp without a way to really verify that somehow it has not been reused somewhere. At the very least the nodes for that temp would have to be marked
|
@CarolEidt, would such an approach also be applicable to |
Yes, I think so. We may even want to consider modeling these kinds of intrinsics internally as producing a struct of two values in the JIT front-end, so that we don't have to rely on combining them later. But I think we might be able to do that as a separate step. |
I added Register lifetimes should be accurate - tracking register lifetime is the primary reason for the I agree that this solution isn't ideal, but all the changes are entirely contained to MULX code paths and this would make MULX actually useful (as it currently does memory accesses and wastes registers, it's actually faster to do 3-4 regular IMULs to compute the same result). The MultiplyNoFlagsHigh + MultiplyNoFlagsLow alternative looks more complicated as it would require the pair to flow through all the phases from importation to lowering (the Ideally there should be a way to specify two-value nodes in IR, not just for MULX, but also for |
They could be determined to be unused after |
@CarolEidt, what is the state of this PR with relation to the multi-reg return work you are doing? That is, should the PR move forward with some needed changes or should it be closed as this isn't how we want to handle these long term? |
Even if we want a more general solution in long term (based on the multi-reg work), this PR could still be used in the meantime as a very specific and contained fix.
I've tried to think of situations where this could happen (to test out the code), but I'm not sure how they could end up being marked unused after lowering but before LSRA. Even if they end up unused after that, the register lifetimes will still be exact and tracked by the two adjacent nodes. |
Having looked at this further, I remain concerned. At least the temp won't be optimized away, since it is marked address-taken and so is never tracked. Here, however, are the issues that make this approach undesirable:
I've added a note to the original issue, #11782 (comment), with a link to my branch that has a prototype for how we can support this more holistically. |
Closing state PRs. |
This is an updated version of dotnet/coreclr#22047 with more comments.
Fixes #11782
This also fixes containment for MULX that was disabled in dotnet/coreclr#23511.
@CarolEidt could you take a look at this?