Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify/optimize Int128 comparison operators #101848

Merged
merged 4 commits into from
May 9, 2024

Conversation

lilinus
Copy link
Contributor

@lilinus lilinus commented May 3, 2024

We can remove a some branches in comparison operators for Int128 by using signed comparison for upper 64-bit values.
Maybe this approach was already considered but not chosen, in that case I will close the PR.

Also added some test cases for some better coverage on said operators.

It was hard to benchmark any performance gain because (getting ZeroMeasurement) so I had to run the operators inside a loop.
Link to benchmark source code here.
Below are benchmarks for > operator (other operators gives similar results):

Method Job Toolchain val1 val2 Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
op_GreaterThan Job-QLZCZD main -170141183460(...)3715884105728 [40] -1 28.16 ns 0.092 ns 0.086 ns 28.15 ns 28.04 ns 28.32 ns 1.00 - NA
op_GreaterThan Job-PVMDKE pr -170141183460(...)3715884105728 [40] -1 27.86 ns 0.046 ns 0.043 ns 27.87 ns 27.81 ns 27.95 ns 0.99 - NA
op_GreaterThan Job-QLZCZD main -170141183460(...)3715884105728 [40] 1 27.89 ns 0.061 ns 0.051 ns 27.91 ns 27.83 ns 27.97 ns 1.00 - NA
op_GreaterThan Job-PVMDKE pr -170141183460(...)3715884105728 [40] 1 28.02 ns 0.072 ns 0.067 ns 28.02 ns 27.92 ns 28.13 ns 1.00 - NA
op_GreaterThan Job-QLZCZD main -1 -1 28.90 ns 0.088 ns 0.073 ns 28.94 ns 28.71 ns 28.97 ns 1.00 - NA
op_GreaterThan Job-PVMDKE pr -1 -1 27.85 ns 0.047 ns 0.041 ns 27.86 ns 27.80 ns 27.93 ns 0.96 - NA
op_GreaterThan Job-QLZCZD main -1 1 27.86 ns 0.064 ns 0.054 ns 27.85 ns 27.73 ns 27.93 ns 1.00 - NA
op_GreaterThan Job-PVMDKE pr -1 1 27.86 ns 0.068 ns 0.064 ns 27.84 ns 27.75 ns 28.00 ns 1.00 - NA
op_GreaterThan Job-QLZCZD main 0 -1 27.85 ns 0.054 ns 0.051 ns 27.87 ns 27.79 ns 27.92 ns 1.00 - NA
op_GreaterThan Job-PVMDKE pr 0 -1 28.01 ns 0.096 ns 0.080 ns 28.00 ns 27.86 ns 28.11 ns 1.01 - NA
op_GreaterThan Job-QLZCZD main 0 1 28.02 ns 0.083 ns 0.078 ns 27.98 ns 27.88 ns 28.14 ns 1.00 - NA
op_GreaterThan Job-PVMDKE pr 0 1 27.84 ns 0.055 ns 0.049 ns 27.84 ns 27.79 ns 27.94 ns 0.99 - NA
op_GreaterThan Job-QLZCZD main 1 -1 27.81 ns 0.087 ns 0.081 ns 27.82 ns 27.68 ns 27.98 ns 1.00 - NA
op_GreaterThan Job-PVMDKE pr 1 -1 28.03 ns 0.067 ns 0.063 ns 28.04 ns 27.88 ns 28.12 ns 1.01 - NA
op_GreaterThan Job-QLZCZD main 1 1 27.83 ns 0.087 ns 0.078 ns 27.84 ns 27.64 ns 27.97 ns 1.00 - NA
op_GreaterThan Job-PVMDKE pr 1 1 28.00 ns 0.104 ns 0.086 ns 28.02 ns 27.82 ns 28.10 ns 1.01 - NA
op_GreaterThan Job-QLZCZD main 1701411834604(...)3715884105727 [39] -1 28.92 ns 0.066 ns 0.059 ns 28.92 ns 28.76 ns 28.99 ns 1.00 - NA
op_GreaterThan Job-PVMDKE pr 1701411834604(...)3715884105727 [39] -1 27.58 ns 0.054 ns 0.051 ns 27.59 ns 27.50 ns 27.66 ns 0.95 - NA
op_GreaterThan Job-QLZCZD main 1701411834604(...)3715884105727 [39] 1 27.79 ns 0.082 ns 0.077 ns 27.81 ns 27.66 ns 27.92 ns 1.00 - NA
op_GreaterThan Job-PVMDKE pr 1701411834604(...)3715884105727 [39] 1 27.64 ns 0.061 ns 0.054 ns 27.63 ns 27.54 ns 27.74 ns 0.99 - NA

In some cases it seems we get ~5% perf improvements.
Code (and codegen) is more compact. sharplab.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 3, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 3, 2024
@tannergooding
Copy link
Member

Your benchmark code is invalid and not testing what you think it's testing.

Since the result of val1 > val2 is unused, the JIT is able to optimize it out resulting in you testing an empty loop: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHuGAHKAJYA3bBhgNIAO1wYGAybIFioogRGkMAvA3IAGXQG4uPY93pMUDCLwD6AcViiYUACoALbJIAUASQXlSAA4GEQAbcjQGPwwA4LDSAEpTDmSeBgAzaAYveUUtBkM5BgAeOWVVdVwDOQBqGqTqNKaeG3yw8gYAPhDsUNIjRp4AXxohoA==

You can somewhat fix that by doing something like this, where you combine all the comparisons together: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHuGAHKAJYA3bBhgNIAO1wYGAybIFioogRGkMAvA3IAGXQG4uPY93oNgECABsGEXgH0A4rFEwoAFQAW2SQAoASQVyUgAOBhFrcjQGIIwQ8MjSAEpTDjSeCytbWFwAV2tZbQAzbGtcGCNqTMyMnmLoBj95RS0GQzkGAB45ZVV1XAM5AGph1OqaydyC2UJtSPIGAD4IstIqye46swB2BmnCjZ4AXxpjoA==

However, that itself is also a "point in time" consideration as some runtimes or future versions of RyuJIT may introspect that val1 > val2 is invariant and therefore it could be rewritten to simply return val1 > val2;

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is purely simpler and is still correct

That being said, it's worth noting that this simplification doesn't buy much and the real optimization requires some JIT work so we can simply emit a cmp; sbb; setcc sequence instad.

{
return IsNegative(left);
}
return ((long)left._upper < (long)right._upper)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment (in the code) explaining why this is valid would be beneficial.

Actually brute forcing this for some struct S16 { byte _lower; byte _upper; } is pretty trivial to do, but so is describing why this works out given the two's complement representation. -- Notably this comes down to the fact that all negative values have the sign bit set, so if lhs.upper < rhs.upper in signed form, then either lhs/rhs have different signs, and the single comparison was sufficient or they have the same sign and the same was true.


There may be other optimizations available here as well... In particular, comparisons x cmp y in general are done as if x - y and then setting CPU flags, without actually producing the result. That general premise is why the "most optimal" pattern is actually cmp lower; sbb upper; setcc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried my best documenting it with comments.

@lilinus
Copy link
Contributor Author

lilinus commented May 5, 2024

Code is purely simpler and is still correct

That being said, it's worth noting that this simplification doesn't buy much and the real optimization requires some JIT work so we can simply emit a cmp; sbb; setcc sequence instad.

Thanks for feedback.

And indeed, that approach would be the most optimal. But I had a hunch it sadly is not really possible in C# at the moment.

@tannergooding tannergooding merged commit 38467bb into dotnet:main May 9, 2024
143 of 146 checks passed
@lilinus lilinus deleted the optimize-int128-comparisons branch May 13, 2024 13:02
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
* Optimize In128 comparison operators

* Document how Int128 comparison operators work
@github-actions github-actions bot locked and limited conversation to collaborators Jun 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
community-contribution Indicates that the PR has been added by a community member needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants