Simplify/optimize Int128 comparison operators #101848

lilinus · 2024-05-03T14:38:25Z

We can remove a some branches in comparison operators for Int128 by using signed comparison for upper 64-bit values.
Maybe this approach was already considered but not chosen, in that case I will close the PR.

Also added some test cases for some better coverage on said operators.

It was hard to benchmark any performance gain because (getting ZeroMeasurement) so I had to run the operators inside a loop.
Link to benchmark source code here.
Below are benchmarks for > operator (other operators gives similar results):

Method	Job	Toolchain	val1	val2	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
op_GreaterThan	Job-QLZCZD	main	-170141183460(...)3715884105728 [40]	-1	28.16 ns	0.092 ns	0.086 ns	28.15 ns	28.04 ns	28.32 ns	1.00	-	NA
op_GreaterThan	Job-PVMDKE	pr	-170141183460(...)3715884105728 [40]	-1	27.86 ns	0.046 ns	0.043 ns	27.87 ns	27.81 ns	27.95 ns	0.99	-	NA
op_GreaterThan	Job-QLZCZD	main	-170141183460(...)3715884105728 [40]	1	27.89 ns	0.061 ns	0.051 ns	27.91 ns	27.83 ns	27.97 ns	1.00	-	NA
op_GreaterThan	Job-PVMDKE	pr	-170141183460(...)3715884105728 [40]	1	28.02 ns	0.072 ns	0.067 ns	28.02 ns	27.92 ns	28.13 ns	1.00	-	NA
op_GreaterThan	Job-QLZCZD	main	-1	-1	28.90 ns	0.088 ns	0.073 ns	28.94 ns	28.71 ns	28.97 ns	1.00	-	NA
op_GreaterThan	Job-PVMDKE	pr	-1	-1	27.85 ns	0.047 ns	0.041 ns	27.86 ns	27.80 ns	27.93 ns	0.96	-	NA
op_GreaterThan	Job-QLZCZD	main	-1	1	27.86 ns	0.064 ns	0.054 ns	27.85 ns	27.73 ns	27.93 ns	1.00	-	NA
op_GreaterThan	Job-PVMDKE	pr	-1	1	27.86 ns	0.068 ns	0.064 ns	27.84 ns	27.75 ns	28.00 ns	1.00	-	NA
op_GreaterThan	Job-QLZCZD	main	0	-1	27.85 ns	0.054 ns	0.051 ns	27.87 ns	27.79 ns	27.92 ns	1.00	-	NA
op_GreaterThan	Job-PVMDKE	pr	0	-1	28.01 ns	0.096 ns	0.080 ns	28.00 ns	27.86 ns	28.11 ns	1.01	-	NA
op_GreaterThan	Job-QLZCZD	main	0	1	28.02 ns	0.083 ns	0.078 ns	27.98 ns	27.88 ns	28.14 ns	1.00	-	NA
op_GreaterThan	Job-PVMDKE	pr	0	1	27.84 ns	0.055 ns	0.049 ns	27.84 ns	27.79 ns	27.94 ns	0.99	-	NA
op_GreaterThan	Job-QLZCZD	main	1	-1	27.81 ns	0.087 ns	0.081 ns	27.82 ns	27.68 ns	27.98 ns	1.00	-	NA
op_GreaterThan	Job-PVMDKE	pr	1	-1	28.03 ns	0.067 ns	0.063 ns	28.04 ns	27.88 ns	28.12 ns	1.01	-	NA
op_GreaterThan	Job-QLZCZD	main	1	1	27.83 ns	0.087 ns	0.078 ns	27.84 ns	27.64 ns	27.97 ns	1.00	-	NA
op_GreaterThan	Job-PVMDKE	pr	1	1	28.00 ns	0.104 ns	0.086 ns	28.02 ns	27.82 ns	28.10 ns	1.01	-	NA
op_GreaterThan	Job-QLZCZD	main	1701411834604(...)3715884105727 [39]	-1	28.92 ns	0.066 ns	0.059 ns	28.92 ns	28.76 ns	28.99 ns	1.00	-	NA
op_GreaterThan	Job-PVMDKE	pr	1701411834604(...)3715884105727 [39]	-1	27.58 ns	0.054 ns	0.051 ns	27.59 ns	27.50 ns	27.66 ns	0.95	-	NA
op_GreaterThan	Job-QLZCZD	main	1701411834604(...)3715884105727 [39]	1	27.79 ns	0.082 ns	0.077 ns	27.81 ns	27.66 ns	27.92 ns	1.00	-	NA
op_GreaterThan	Job-PVMDKE	pr	1701411834604(...)3715884105727 [39]	1	27.64 ns	0.061 ns	0.054 ns	27.63 ns	27.54 ns	27.74 ns	0.99	-	NA

In some cases it seems we get ~5% perf improvements.
Code (and codegen) is more compact. sharplab.

tannergooding · 2024-05-03T15:08:15Z

Your benchmark code is invalid and not testing what you think it's testing.

Since the result of val1 > val2 is unused, the JIT is able to optimize it out resulting in you testing an empty loop: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHuGAHKAJYA3bBhgNIAO1wYGAybIFioogRGkMAvA3IAGXQG4uPY93pMUDCLwD6AcViiYUACoALbJIAUASQXlSAA4GEQAbcjQGPwwA4LDSAEpTDmSeBgAzaAYveUUtBkM5BgAeOWVVdVwDOQBqGqTqNKaeG3yw8gYAPhDsUNIjRp4AXxohoA==

You can somewhat fix that by doing something like this, where you combine all the comparisons together: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHuGAHKAJYA3bBhgNIAO1wYGAybIFioogRGkMAvA3IAGXQG4uPY93oNgECABsGEXgH0A4rFEwoAFQAW2SQAoASQVyUgAOBhFrcjQGIIwQ8MjSAEpTDjSeCytbWFwAV2tZbQAzbGtcGCNqTMyMnmLoBj95RS0GQzkGAB45ZVV1XAM5AGph1OqaydyC2UJtSPIGAD4IstIqye46swB2BmnCjZ4AXxpjoA==

However, that itself is also a "point in time" consideration as some runtimes or future versions of RyuJIT may introspect that val1 > val2 is invariant and therefore it could be rewritten to simply return val1 > val2;

tannergooding

Code is purely simpler and is still correct

That being said, it's worth noting that this simplification doesn't buy much and the real optimization requires some JIT work so we can simply emit a cmp; sbb; setcc sequence instad.

tannergooding · 2024-05-03T15:59:33Z

src/libraries/System.Private.CoreLib/src/System/Int128.cs

-            {
-                return IsNegative(left);
-            }
+            return ((long)left._upper < (long)right._upper)


A comment (in the code) explaining why this is valid would be beneficial.

Actually brute forcing this for some struct S16 { byte _lower; byte _upper; } is pretty trivial to do, but so is describing why this works out given the two's complement representation. -- Notably this comes down to the fact that all negative values have the sign bit set, so if lhs.upper < rhs.upper in signed form, then either lhs/rhs have different signs, and the single comparison was sufficient or they have the same sign and the same was true.

There may be other optimizations available here as well... In particular, comparisons x cmp y in general are done as if x - y and then setting CPU flags, without actually producing the result. That general premise is why the "most optimal" pattern is actually cmp lower; sbb upper; setcc

I tried my best documenting it with comments.

lilinus · 2024-05-05T10:22:33Z

Code is purely simpler and is still correct

That being said, it's worth noting that this simplification doesn't buy much and the real optimization requires some JIT work so we can simply emit a cmp; sbb; setcc sequence instad.

Thanks for feedback.

And indeed, that approach would be the most optimal. But I had a hunch it sadly is not really possible in C# at the moment.

* Optimize In128 comparison operators * Document how Int128 comparison operators work

Optimize In128 comparison operators

70e457c

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 3, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 3, 2024

Merge branch 'main' into optimize-int128-comparisons

848c12b

tannergooding approved these changes May 3, 2024

View reviewed changes

tannergooding reviewed May 3, 2024

View reviewed changes

Document how Int128 comparison operators work

75a3579

Merge branch 'main' into optimize-int128-comparisons

0630152

build-analysis bot mentioned this pull request May 6, 2024

Test failure in System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanDestinationFunctions_SpecialValues #101731

Closed

tannergooding merged commit 38467bb into dotnet:main May 9, 2024
143 of 146 checks passed

lilinus deleted the optimize-int128-comparisons branch May 13, 2024 13:02

Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024

Simplify/optimize Int128 comparison operators (dotnet#101848)

ea99b50

* Optimize In128 comparison operators * Document how Int128 comparison operators work

github-actions bot locked and limited conversation to collaborators Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify/optimize Int128 comparison operators #101848

Simplify/optimize Int128 comparison operators #101848

lilinus commented May 3, 2024

tannergooding commented May 3, 2024

tannergooding left a comment

tannergooding May 3, 2024

lilinus May 5, 2024

lilinus commented May 5, 2024

Simplify/optimize Int128 comparison operators #101848

Simplify/optimize Int128 comparison operators #101848

Conversation

lilinus commented May 3, 2024

tannergooding commented May 3, 2024

tannergooding left a comment

Choose a reason for hiding this comment

tannergooding May 3, 2024

Choose a reason for hiding this comment

lilinus May 5, 2024

Choose a reason for hiding this comment

lilinus commented May 5, 2024