-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use BigMul for 32x32=64 in decimal #93345
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsNo need to have a separate implementation in DecCalc
|
@@ -184,20 +184,7 @@ private static ulong UInt32x32To64(uint a, uint b) | |||
|
|||
private static void UInt64x64To128(ulong a, ulong b, ref DecCalc result) | |||
{ | |||
ulong low = UInt32x32To64((uint)a, (uint)b); // lo partial prod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you delete UInt32x32To64
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could yes, but:
- Unfortunately does not exist any BigMul method for two uint32.
- It is used in a lot of places where narrowing/casting from ulongs to uints are done, and the code is pretty convoluted already. This shows intent quite nicely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you measure the performance of this change? Last time I checked, a similar change caused a significant performance regression because the codegen for Bmi2.MultiplyNoFlags
is suboptimal.
@@ -184,20 +184,7 @@ private static ulong UInt32x32To64(uint a, uint b) | |||
|
|||
private static void UInt64x64To128(ulong a, ulong b, ref DecCalc result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced, technically speaking, by Math.BigMul(ulong, ulong, out ulong)
as well.
Then it can also be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure, NB that overflow check and insertion to DecCalc result are still in body below. Do we prefer to move those around to where this is now called or should i keep this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to keep the extra checks centralized here and just defer the algorithm to Math.BigMul
@dotnet-policy-service agree |
@@ -381,7 +376,7 @@ private static uint Div96By64(ref Buf12 bufNum, ulong den) | |||
|
|||
// Compute full remainder, rem = dividend - (quo * divisor). | |||
// | |||
ulong prod = UInt32x32To64(quo, (uint)den); // quo * lo divisor | |||
ulong prod = quo * (den & uint.MaxValue); // quo * lo divisor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs a comment on what it's doing.
But, notably, this may regress 32-bit platforms as it will now do a more expensive 64x64=64
multiplication, rather than doing the cheaper 32x32=64
.
In general an internal Math.BigMul(uint a, uint b, out uint low)
could be defined that uses Bmi2.MultiplyNoFlags
, ArmBase.MultiplyHigh
, and otherwise falls back to the naive algorithm of (ulong)a * b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all the stuff into an internal ulong Math.BigMul(uint, uint)
(since we had a long Math.BigMul(int, int)
). I tried using initrinsic for x86 on 32 bit. Doesn't seem to exist a ArmBase.MultiplyHigh
yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Closing and reopening to requeue CI and try to get the tests all passing. They look unrelated. |
if (Bmi2.IsSupported) | ||
{ | ||
uint low; | ||
uint high = Bmi2.MultiplyNoFlags(a, b, &low); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you measure the performance for this? It's very likely to be significantly worse than the previous codegen that the JIT automatically generated for uint*uint->ulong multiplication on x86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed a lot worse, which is strange since it shouldn't be. mulx
is the preferred instruction to do this on x86, at least for cases like this.
Guessing the JIT has various work that needs to be done to ensure it does the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry no i didn't know how to do benchmarks. Probably we should just remove the intrinsic block for x86 in BigMul methods if they are slow? Same for 64 x 64 = 128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we should just remove the intrinsic block for x86 in BigMul methods if they are slow
Yes, preferably with an issue around it.
Same for 64 x 64 = 128
No. There isn't a primitive type and no corresponding decomp or other work that makes the naive thing more efficient.
There's still more improvements that could be done around Bmi2.X64.MultiplyNoFlags
, but nothing that makes it less efficient than the FOIL based fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And a 3rd issue that is very closely related to this DecCalc
code: #58263
No need to have a separate implementation in DecCalc