Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bmi2.MultiplyNoFlags issues #11782

Open
pentp opened this issue Jan 8, 2019 · 26 comments
Open

Bmi2.MultiplyNoFlags issues #11782

pentp opened this issue Jan 8, 2019 · 26 comments
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@pentp
Copy link
Contributor

pentp commented Jan 8, 2019

I have a working version of decimal.DecCalc which uses MultiplyNoFlags from dotnet/coreclr#21480 in a branch, but I discovered two issues.

If MultiplyNoFlags is called without having its result used, then it's assumed to be a no-op, even if the low part is used. While such use would be sub-optimal, it should still be valid (fixed in dotnet/coreclr#21928).

The second problem is that performance is increased only up to 3% for some methods, while others suffer a performance penalty up to 20%! This is primarily caused by forcing the low result to be written to memory and excessive temporary register use, compounded by forced zero-init of the locals (even with no .locals init) which affects all code paths of the function.

static unsafe ulong mulx(ulong a, ulong b)
{
	ulong r;
	return X86.Bmi2.X64.MultiplyNoFlags(a, b, &r) + r;
}
; Assembly listing for method DecCalc:mulx(long,long):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )    long  ->  rcx        
;  V01 arg1         [V01,T01] (  3,  3   )    long  ->  [rsp+0x18]  
;  V02 loc0         [V02    ] (  2,  2   )    long  ->  [rsp+0x00]   do-not-enreg[X] must-init addr-exposed ld-addr-op
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 8

G_M42317_IG01:
       push     rax
       xor      rax, rax
       mov      qword ptr [rsp], rax
       mov      qword ptr [rsp+18H], rdx

G_M42317_IG02:
       lea      rax, bword ptr [rsp]
       mov      rdx, rcx
       mulx     rdx, r8, qword ptr [rsp+18H]
       mov      qword ptr [rax], r8
       mov      rax, rdx
       add      rax, qword ptr [rsp]

G_M42317_IG03:
       add      rsp, 8
       ret      

; Total bytes of code 41, prolog size 7 for method DecCalc:mulx(long,long):long

While ideally this should be just:

       mulx     rax, rcx, rcx
       add      rax, rcx
       ret      

category:cq
theme:vector-codegen
skill-level:expert
cost:medium
impact:small

@pentp
Copy link
Contributor Author

pentp commented Jan 8, 2019

Note that temp register use depends on the order of parameters (with respect to the implied EDX/RDX parameter), so reversing a and b in the sample improves the codegen somewhat (this might be relatively easy to fix). Getting rid of the lea for locals could probably be done at emit time even. And must-init can be probably turned off also.

The hard part is getting MultiplyNoFlags working without forcing a memory store if a local variable is used for the low result.
@fiigii @mikedn, how hard would it be to improve MULX codegen with the constraint that the low part is stored to a local variable?

@fiigii
Copy link
Contributor

fiigii commented Jan 9, 2019

If MultiplyNoFlags is called without having its result used, then it's assumed to be a no-op

I will fix the first problem tomorrow.

how hard would it be to improve MULX codegen with the constraint that the low part is stored to a local variable?

As I know, RyuJIT only builds SSA on normal local vars and does not have the general memory analysis mechanism (e.g., Memory SSA), so eliminating this kind of redundant memory store/loads looks a big job. But I may be wrong.
@CarolEidt @AndyAyersMS Do you have suggestions?

Originally, I suggested to expose this intrinsic as

ValueTuple<ulong, ulong> MultiplyNoFlags(ulong left, ulong right);

This API could be optimized via struct promotion that won't have memory access on most cases. However, I was noticed that ValueTuple is not good in public APIs...

@mikedn
Copy link
Contributor

mikedn commented Jan 9, 2019

As I know, RyuJIT only builds SSA on normal local vars and does not have the general memory analysis mechanism (e.g., Memory SSA), so eliminating this kind of redundant memory store/loads looks a big job.

It has less to do with SSA and more to do with the ability to represent this in the IR.

This API could be optimized via struct promotion that won't have memory access on most cases.

Unlikely, you'll end up with dependent struct promotion and thus in memory. Multireg return calls (including long return calls on 32 bit) also suffer from a similar problem.

For example:

       FF153C7B1305 call     [C:GetLong():long]
       8945F0       mov      dword ptr [ebp-10H], eax
       8955F4       mov      dword ptr [ebp-0CH], edx
       8B75F0       mov      esi, dword ptr [ebp-10H]
       8B7DF4       mov      edi, dword ptr [ebp-0CH]
       FF153C7B1305 call     [C:GetLong():long]
       8945E8       mov      dword ptr [ebp-18H], eax
       8955EC       mov      dword ptr [ebp-14H], edx
       8BC6         mov      eax, esi
       0345E8       add      eax, dword ptr [ebp-18H]
       8BD7         mov      edx, edi
       1355EC       adc      edx, dword ptr [ebp-14H]

which should be something like

       call     [C:GetLong():long]
       mov      esi, eax
       mov      edi, edx
       call     [C:GetLong():long]
       add      eax, esi
       adc      edx, edi

Not to say that this cannot be fixed. But it's not that simple.

And yes, returning a pair of values might make more sense than using an "out" parameter.

@pentp
Copy link
Contributor Author

pentp commented Jan 9, 2019

The public API is probably not holding us back - it could just forward to a private ValueTuple version if it helped or this transformation could be achieved in JIT import. Importation could ensure that the result is always stored to a (temp) local variable - this might simplify/improve the codegen somewhat. But fundamentally the problem is modeling a two-value IR node (applies to all three variants).

@fiigii
Copy link
Contributor

fiigii commented Jan 9, 2019

Unlikely, you'll end up with dependent struct promotion and thus in memory. Multireg return calls (including long return calls on 32 bit) also suffer from a similar problem.

@mikedn Ah, thanks for the info. Are there issues to track this problem?

@pentp
Copy link
Contributor Author

pentp commented Jan 9, 2019

What if the importer transformed local variable use cases from

ldloca/ldarga op3
conv.u
call MultiplyNoFlags(op1, op2, op3)

into

call MultiplyNoFlags(op1, op2)
ldmagic /*load low part*/
stloc/starg op3

and all other cases into

stloc addr_tmp
call MultiplyNoFlags(op1, op2)
ldloc addr_tmp
ldmagic /*load low part*/
stind

where ldmagic would be some new IR node type that is tied to the second return value of mulx? And it could be stitched together using a GT_COMMA maybe?

@mikedn
Copy link
Contributor

mikedn commented Jan 9, 2019

Ah, thanks for the info. Are there issues to track this problem?

None that I know of.

where ldmagic would be some new IR node type that is tied to the second return value of mulx

There's no such thing as "tied to the second return value" of a node in IR. That would imply that a node has multiple uses (it cannot) and that a use can specify which value is used (it cannot).

Something like this might work in lowering, where the "magic" node would act like second return value. But by the time you reach lowering the damage is already done.

@pentp
Copy link
Contributor Author

pentp commented Jan 10, 2019

Flags from and are a kind of a second return value? Maybe something similar would work? Or some kind of an implied return value that appears out of thin air at first, but is later tied to the preceding mulx in LSRA/emit?
Most of the intermediate phases are not very useful on two-value MultiplyNoFlags anyway (value num, CSE, assertions), the primary requirement would be that this "magic" load is kept right after mulx throughout the process.

@mikedn
Copy link
Contributor

mikedn commented Jan 10, 2019

Flags from and are a kind of a second return value? Maybe something similar would work?

Yes, that's what I was thinking of when I said that something like this might work in lowering. Though it's a bit different as LSRA does not model the flag register but of course, it does model integer registers. Besides, the flag stuff is a bit on the hackish side, it gets the job done but it's not quite the right thing to do.

Most of the intermediate phases are not very useful on two-value MultiplyNoFlags anyway (value num, CSE, assertions), the primary requirement would be that this "magic" load is kept right after mulx throughout the process.

Yes, but in the case of the current API it's kind of difficult to deal with the address taken variable.

A version that returns a struct, hmm, I think that's really the best option and the only one that doesn't look too hackish. But as I told @fiigii before, there's a problem with struct promotion. I should check up what's up wit that, perhaps we can make it so that certain cases of dependent struct promotion don't force the struct in memory. As I already mentioned, that would help other cases as well.

@tannergooding
Copy link
Member

A version that returns a struct, hmm, I think that's really the best option and the only one that doesn't look too hackish

I think, if we did this, it would either need to be an internal method or would need to use a strongly named type (like Int128/UInt128). The guidance so far has been don't use ValueTuple in the public surface area (and the only time we tried, which was fairly recently, it ended up getting backed out). NOTE: Exposing a strongly named type (like Int128/UInt128) would probably not get reviewed/approved/implemented in time for the initial release (as there are a lot of other considerations here as well).

It might also be nice to still find a solution for the address taken scenario; since we have some existing methods (like Math.DivRem) which we can't change the public surface area for but which will require similar fixes; and since we would still want good codegen for the address taken scenario...

@pentp
Copy link
Contributor Author

pentp commented Jan 10, 2019

The address taken scenario can always be transformed into a stloc addr; call mulx; ldmagic; stloc tmp; ldloc addr; ldloc tmp; stind so that the problem is again reduced to this call mulx; ldmagic pair. But this variant is actually pretty optimal already (low address points to some non-local memory location).

If I'm not mistaken, then after struct promotion ValueTuple would end up with the same problem of two return values and two (promoted) variables.

The current API shape might actually be a blessing because it allows very easy detection of the case where the out parameter points to a local.

@mikedn
Copy link
Contributor

mikedn commented Jan 10, 2019

The address taken scenario can always be transformed into a stloc addr; call mulx; ldmagic; stloc tmp; ldloc addr; ldloc tmp; stind so that the problem is again reduced to this call mulx; ldmagic pair

Yes, introducing a temp could help but there are certain problems that will have to be solved for this to actually work. The idea would be to introduce this temp during import such that

ASG high, MULX op1, op2, op3
; becomes
MULX op1, op2, ADDR temp
ASG IND(op3), temp
; if op3 represents the address of a lcl then this reduces to
ASG low, temp

This prevents the original local variable from become address taken. Instead, temp becomes address taken, a good thing because it should prevent the optimizer from doing anything that will then prevent lowering from recognizing this pattern. In particular, no other uses of temp should ever appear.

Then lowering could probably transform this into something like:

STORE_LCL_VAR high, MULX …
STORE_LCL_VAR low, MAGIC

So far so good. The problem is dealing with register allocation for such a thing. Couple of observations for now:

  • MULX will likely need to be a multi-reg node, one way or another you need to allocate 2 destination registers for it.
  • MAGIC might need to have MULX as "hidden" operand. "Hidden" in the sense that it has a pointer to the MULX node but it doesn't consider it an operand. That would violate the requirement that an IR node must have a single use. Question is if LSRA would tolerate that.
  • Care needs to be taken to avoid an extra reg-reg move - low and MULX's corresponding low destination register should be the same whenever possible.

So I don't know, it might be doable. Some additional investigation and perhaps experimentation will be needed.

On top of that, I'm not sure if the JIT team will be happy with that magic node.

@mikedn
Copy link
Contributor

mikedn commented Jan 10, 2019

The current API shape might actually be a blessing because it allows very easy detection of the case where the out parameter points to a local.

But "easy detection of the case where the out parameter points to a local" isn't that much of a problem. There are already places where the JIT detects such cases and eliminates them, provided that it has something to replace the tree with. For example, IND(ADDR(LCL_VAR)) is usually transformed into LCL_VAR and the local variable is not considered address taken.

But in the case of MULX the problem is that you don't have something reasonable to transform to.

@mikedn
Copy link
Contributor

mikedn commented Jan 10, 2019

If I'm not mistaken, then after struct promotion ValueTuple would end up with the same problem of two return values and two (promoted) variables.

Sort of. The problem with relying on struct promotion is that you have to do something like:

ASG LCL_VAR pair, MULX ; pair is a struct var having 2 int fields
; later you can use the individual fields
… LCL_VAR pair.low
… LCL_VAR pair.high

It's pretty natural. The problem with this is that if you promote then you shouldn't use the entire struct variable anymore, you should only use its individual fields. And if you do access the entire struct variable then it cannot be enregistered and it ends up in memory, what we were trying to avoid in the first place.

So the problem here is to investigate if it's possible to change this. This probably will end up in the register allocation territory as well. Registers must be allocated to the individual fields as usual but then we need to be able to store to both registers using the same STORE_LCL_VAR node.

Again, it might be doable. But it needs investigation.

@mikedn
Copy link
Contributor

mikedn commented Jan 10, 2019

And I suspect that either way need MULX to be a multi-reg node. That means that we'll need to create a special oper for it and use GenTreeMultiRegOp instead of the current HW intrinsic oper and class.

@CarolEidt
Copy link
Contributor

Unlikely, you'll end up with dependent struct promotion and thus in memory. Multireg return calls (including long return calls on 32 bit) also suffer from a similar problem.

Are there issues to track this problem?

There are actual multiple issues: #5024, #6316, #6318, #9839, #5112 and #8571 (the last two are probably the most relevant to this specific case). While I continue to make a little progress on the struct issues, it's a process of many tiny steps.

If we go the route of introducing a temp during import, I would think a better way to transform it would be, earlier in the JIT (morph? optimizer?) to transform it to an appropriate multi-reg op, as @mikedn hints at above. The "original" form would continue to be a memory store, but the alternate form would write two registers as an internal two-field struct. It might take some work to eliminate the copies that may be induced (especially the memory copies), but I think that's the only reasonable path to take.

On top of that, I'm not sure if the JIT team will be happy with that magic node.

I'd really like to avoid such a thing!

@mikedn
Copy link
Contributor

mikedn commented Jan 11, 2019

I'd really like to avoid such a thing!

Thought so 😄. Even the CMP/JCC case is a bit ugly in this regard, but at least that one doesn't interfere with register allocation.

There are actual multiple issues: #5024, #6316, #6318, #9839, #5112 and #8571 (the last two are probably the most relevant to this specific case). While I continue to make a little progress on the struct issues, it's a process of many tiny steps.

Thanks for the list of issues, I'll take a look. Anyway I'm looking a struct issues these days due to my attempt at getting rid of GT_ASG.

I would think a better way to transform it would be, earlier in the JIT (morph? optimizer?)

Local address taken stuff, probably this should be done in LocalAddressVisitor. But this is a minor issue.

@fiigii
Copy link
Contributor

fiigii commented Jan 15, 2019

@pentp The first problem has been solved by dotnet/coreclr#21928, could you please update this issue?

@CarolEidt
Copy link
Contributor

#34822 added support for intrinsics that define two registers, and once #36862 is merged we'll be able to store that multi-reg result to an enregistered lclVar.

I have a WIP branch, https://github.com/CarolEidt/runtime/tree/Mulx2, that builds on #36862 and adds a mulx intrinsic that returns a ValueTuple, and can enregister both fields of that result.

@pentp
Copy link
Contributor Author

pentp commented Jun 7, 2020

Should I create an API proposal for the new ValueTuple returning MULX to expedite the process?
Maybe we could name the new method just Bmi2.Multiply to solve the naming conflict (instead of MultiplyNoFlags2)? There isn't any functional difference if flags are affected or not (they are inaccessible anyway).

@CarolEidt
Copy link
Contributor

Should I create an API proposal for the new ValueTuple returning MULX to expedite the process?
Maybe we could name the new method just Bmi2.Multiply to solve the naming conflict (instead of MultiplyNoFlags2)? There isn't any functional difference if flags are affected or not (they are inaccessible anyway).

I was planning to leave that decision to the folks with API design expertise. The other alternative would be not to expose it, and leave it as an internal method that the other method calls. One could theoretically make the same kind of transformation in the JIT, but it doesn't currently have the capability to create something like a ValueTuple on its own.

@tannergooding - perhaps this would be a good general design discussion to have, as I believe there are other intrinsics that define multiple registers.

@tannergooding
Copy link
Member

This was brought up minimally for an unrelated API last week and the general consensus was that, especially with the HWIntrinsics, returning value tuples would be fine.
API proposals would be needed and they would need to go through review still

If one was put up, it would be nice to cover the "core" APIs, that is not just Bmi2.MultiplyNoFlags, but also Math.DivRem and Math.BigMul for example.

The conflict given we have a variant of MultiplyNoFlags that just returns the high bits is a bit unfortunate and would likely require some more thought.
Math.BigMul is in a semi-related boat but doesn't quite have a conflict: it has long BigMul(int, int) and ulong BigMul(ulong, ulong, out ulong) and long BigMul(long, long, out long)

@pentp
Copy link
Contributor Author

pentp commented Jun 8, 2020

Yes, naming might best be left for API designers. I'm just thinking of options.
But there is definitely value in exposing the API because currently it forces users into unsafe code because of the ulong* parameter.

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
CarolEidt added a commit to CarolEidt/runtime that referenced this issue Nov 17, 2020
echesakov pushed a commit to echesakov/runtime that referenced this issue Nov 19, 2020
@benaadams
Copy link
Member

Has an api review been opened for the ValueTuple return for BigMul?

Or will this be resolved by MultiplyNoFlags returning a ValueTuple and the BigMul api can remain the same with the Jit resolving it at inline time since its address taken in C# rather than address taken for intrinsic?

/cc @lemire

@benaadams
Copy link
Member

Ah looks like it will be resolved by #37928 and then changing the internal C# implementation without need for another api? #42156 (comment)

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 9, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 10, 2021
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022
@tevador
Copy link

tevador commented Jun 17, 2024

It's 2024 and dotnet 8.0 still suffers from this issue. MultiplyNoFlags has terrible codegen for the low half of the result. In fact, it's much faster to only use the high half returned by MultiplyNoFlags and recalculate the low half with a separate multiplication.

Benchmark sample:

public class MulxTests
{
    [Params(0x0123456789abcdef)]
    public ulong TestA { get; set; }

    [Params(0xdeadbeefdeadbeef)]
    public ulong TestB { get; set; }

    [Benchmark]
    public ulong BenchMulx1()
    {
        ulong accLo = TestA;
        ulong accHi = TestB;
        BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
        return accLo + accHi;
    }

    [Benchmark]
    public ulong BenchMulx2()
    {
        ulong accLo = TestA;
        ulong accHi = TestB;
        BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
        return accLo + accHi;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static unsafe void BigMulAcc1(ulong a, ulong b, ref ulong accHi, ref ulong accLo)
    {
        ulong lo;
        ulong hi = Bmi2.X64.MultiplyNoFlags(a, b, &lo);
        accHi += hi;
        accLo += lo;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static void BigMulAcc2(ulong a, ulong b, ref ulong accHi, ref ulong accLo)
    {
        ulong hi = Bmi2.X64.MultiplyNoFlags(a, b);
        ulong lo = a * b;
        accHi += hi;
        accLo += lo;
    }
}

This benchmark looks artificial, but it's in fact fairly close to actual code used when implementing high-speed elliptic-curve based cryptography.

Results on my machine:

| Method     | TestA             | TestB                | Mean     | Error     | StdDev    | Code Size |
|----------- |------------------ |--------------------- |---------:|----------:|----------:|----------:|
| BenchMulx1 | 81985529216486895 | 16045690984833335023 | 2.521 ns | 0.0075 ns | 0.0070 ns |     129 B |
| BenchMulx2 | 81985529216486895 | 16045690984833335023 | 1.349 ns | 0.0026 ns | 0.0024 ns |      84 B |

So throwing away the low half of the result of mulx and recalculating it with a second multiplication is currently almost twice faster.

The reason why BenchMulx1 is much slower is that the compiler always forces the low half to be stored in memory. If there are multiple calls to MultiplyNoFlags within a method, each will use a different stack location to store its low result, which causes excessive stack usage. For example, with 25 inlined calls to MultiplyNoFlags, the compiler will allocate 200 bytes just for the temporaries!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
9 participants