-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bmi2.MultiplyNoFlags issues #11782
Comments
Note that temp register use depends on the order of parameters (with respect to the implied EDX/RDX parameter), so reversing The hard part is getting |
I will fix the first problem tomorrow.
As I know, RyuJIT only builds SSA on normal local vars and does not have the general memory analysis mechanism (e.g., Memory SSA), so eliminating this kind of redundant memory store/loads looks a big job. But I may be wrong. Originally, I suggested to expose this intrinsic as ValueTuple<ulong, ulong> MultiplyNoFlags(ulong left, ulong right); This API could be optimized via struct promotion that won't have memory access on most cases. However, I was noticed that |
It has less to do with SSA and more to do with the ability to represent this in the IR.
Unlikely, you'll end up with dependent struct promotion and thus in memory. Multireg return calls (including For example: FF153C7B1305 call [C:GetLong():long]
8945F0 mov dword ptr [ebp-10H], eax
8955F4 mov dword ptr [ebp-0CH], edx
8B75F0 mov esi, dword ptr [ebp-10H]
8B7DF4 mov edi, dword ptr [ebp-0CH]
FF153C7B1305 call [C:GetLong():long]
8945E8 mov dword ptr [ebp-18H], eax
8955EC mov dword ptr [ebp-14H], edx
8BC6 mov eax, esi
0345E8 add eax, dword ptr [ebp-18H]
8BD7 mov edx, edi
1355EC adc edx, dword ptr [ebp-14H] which should be something like call [C:GetLong():long]
mov esi, eax
mov edi, edx
call [C:GetLong():long]
add eax, esi
adc edx, edi Not to say that this cannot be fixed. But it's not that simple. And yes, returning a pair of values might make more sense than using an "out" parameter. |
The public API is probably not holding us back - it could just forward to a private ValueTuple version if it helped or this transformation could be achieved in JIT import. Importation could ensure that the result is always stored to a (temp) local variable - this might simplify/improve the codegen somewhat. But fundamentally the problem is modeling a two-value IR node (applies to all three variants). |
@mikedn Ah, thanks for the info. Are there issues to track this problem? |
What if the importer transformed local variable use cases from
into
and all other cases into
where |
None that I know of.
There's no such thing as "tied to the second return value" of a node in IR. That would imply that a node has multiple uses (it cannot) and that a use can specify which value is used (it cannot). Something like this might work in lowering, where the "magic" node would act like second return value. But by the time you reach lowering the damage is already done. |
Flags from |
Yes, that's what I was thinking of when I said that something like this might work in lowering. Though it's a bit different as LSRA does not model the flag register but of course, it does model integer registers. Besides, the flag stuff is a bit on the hackish side, it gets the job done but it's not quite the right thing to do.
Yes, but in the case of the current API it's kind of difficult to deal with the address taken variable. A version that returns a struct, hmm, I think that's really the best option and the only one that doesn't look too hackish. But as I told @fiigii before, there's a problem with struct promotion. I should check up what's up wit that, perhaps we can make it so that certain cases of dependent struct promotion don't force the struct in memory. As I already mentioned, that would help other cases as well. |
I think, if we did this, it would either need to be an internal method or would need to use a strongly named type (like It might also be nice to still find a solution for the address taken scenario; since we have some existing methods (like |
The address taken scenario can always be transformed into a If I'm not mistaken, then after struct promotion The current API shape might actually be a blessing because it allows very easy detection of the case where the out parameter points to a local. |
Yes, introducing a temp could help but there are certain problems that will have to be solved for this to actually work. The idea would be to introduce this temp during import such that
This prevents the original local variable from become address taken. Instead, Then lowering could probably transform this into something like:
So far so good. The problem is dealing with register allocation for such a thing. Couple of observations for now:
So I don't know, it might be doable. Some additional investigation and perhaps experimentation will be needed. On top of that, I'm not sure if the JIT team will be happy with that magic node. |
But "easy detection of the case where the out parameter points to a local" isn't that much of a problem. There are already places where the JIT detects such cases and eliminates them, provided that it has something to replace the tree with. For example, But in the case of MULX the problem is that you don't have something reasonable to transform to. |
Sort of. The problem with relying on struct promotion is that you have to do something like:
It's pretty natural. The problem with this is that if you promote then you shouldn't use the entire struct variable anymore, you should only use its individual fields. And if you do access the entire struct variable then it cannot be enregistered and it ends up in memory, what we were trying to avoid in the first place. So the problem here is to investigate if it's possible to change this. This probably will end up in the register allocation territory as well. Registers must be allocated to the individual fields as usual but then we need to be able to store to both registers using the same Again, it might be doable. But it needs investigation. |
And I suspect that either way need |
There are actual multiple issues: #5024, #6316, #6318, #9839, #5112 and #8571 (the last two are probably the most relevant to this specific case). While I continue to make a little progress on the struct issues, it's a process of many tiny steps. If we go the route of introducing a temp during import, I would think a better way to transform it would be, earlier in the JIT (morph? optimizer?) to transform it to an appropriate multi-reg op, as @mikedn hints at above. The "original" form would continue to be a memory store, but the alternate form would write two registers as an internal two-field struct. It might take some work to eliminate the copies that may be induced (especially the memory copies), but I think that's the only reasonable path to take.
I'd really like to avoid such a thing! |
Thought so 😄. Even the CMP/JCC case is a bit ugly in this regard, but at least that one doesn't interfere with register allocation.
Thanks for the list of issues, I'll take a look. Anyway I'm looking a struct issues these days due to my attempt at getting rid of
Local address taken stuff, probably this should be done in |
@pentp The first problem has been solved by dotnet/coreclr#21928, could you please update this issue? |
#34822 added support for intrinsics that define two registers, and once #36862 is merged we'll be able to store that multi-reg result to an enregistered lclVar. I have a WIP branch, https://github.com/CarolEidt/runtime/tree/Mulx2, that builds on #36862 and adds a mulx intrinsic that returns a ValueTuple, and can enregister both fields of that result. |
Should I create an API proposal for the new ValueTuple returning MULX to expedite the process? |
I was planning to leave that decision to the folks with API design expertise. The other alternative would be not to expose it, and leave it as an internal method that the other method calls. One could theoretically make the same kind of transformation in the JIT, but it doesn't currently have the capability to create something like a @tannergooding - perhaps this would be a good general design discussion to have, as I believe there are other intrinsics that define multiple registers. |
This was brought up minimally for an unrelated API last week and the general consensus was that, especially with the HWIntrinsics, returning value tuples would be fine. If one was put up, it would be nice to cover the "core" APIs, that is not just The conflict given we have a variant of |
Yes, naming might best be left for API designers. I'm just thinking of options. |
Has an api review been opened for the ValueTuple return for BigMul? Or will this be resolved by MultiplyNoFlags returning a ValueTuple and the BigMul api can remain the same with the Jit resolving it at inline time since its address taken in C# rather than address taken for intrinsic? /cc @lemire |
Ah looks like it will be resolved by #37928 and then changing the internal C# implementation without need for another api? #42156 (comment) |
It's 2024 and dotnet 8.0 still suffers from this issue. Benchmark sample: public class MulxTests
{
[Params(0x0123456789abcdef)]
public ulong TestA { get; set; }
[Params(0xdeadbeefdeadbeef)]
public ulong TestB { get; set; }
[Benchmark]
public ulong BenchMulx1()
{
ulong accLo = TestA;
ulong accHi = TestB;
BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
return accLo + accHi;
}
[Benchmark]
public ulong BenchMulx2()
{
ulong accLo = TestA;
ulong accHi = TestB;
BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
return accLo + accHi;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void BigMulAcc1(ulong a, ulong b, ref ulong accHi, ref ulong accLo)
{
ulong lo;
ulong hi = Bmi2.X64.MultiplyNoFlags(a, b, &lo);
accHi += hi;
accLo += lo;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static void BigMulAcc2(ulong a, ulong b, ref ulong accHi, ref ulong accLo)
{
ulong hi = Bmi2.X64.MultiplyNoFlags(a, b);
ulong lo = a * b;
accHi += hi;
accLo += lo;
}
} This benchmark looks artificial, but it's in fact fairly close to actual code used when implementing high-speed elliptic-curve based cryptography. Results on my machine:
So throwing away the low half of the result of mulx and recalculating it with a second multiplication is currently almost twice faster. The reason why |
I have a working version of
decimal.DecCalc
which usesMultiplyNoFlags
from dotnet/coreclr#21480 in a branch, but I discovered two issues.If(fixed in dotnet/coreclr#21928).MultiplyNoFlags
is called without having its result used, then it's assumed to be a no-op, even if thelow
part is used. While such use would be sub-optimal, it should still be validThe second problem is that performance is increased only up to 3% for some methods, while others suffer a performance penalty up to 20%! This is primarily caused by forcing the low result to be written to memory and excessive temporary register use, compounded by forced zero-init of the locals (even with no
.locals init
) which affects all code paths of the function.While ideally this should be just:
category:cq
theme:vector-codegen
skill-level:expert
cost:medium
impact:small
The text was updated successfully, but these errors were encountered: