Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine Bin/Hex parsing of BigInteger #95543

Merged
merged 25 commits into from
Feb 3, 2024

Conversation

huoyaoyuan
Copy link
Member

Instead of counting by digits, the new algorithm parses with uint blocks. It also uses vectorized hex converting for large numbers.

Introduces a new reference of S.R.Intrinsics into S.R.Numerics. I think it's expected if we start to use more SIMD operations for BigInteger.

Performance is measured on different sizes and corner cases, to ensure there's no regression on small values:

Method Job Toolchain input Mean Error StdDev Ratio
ParseHex Job-FGNQZB \1-main\corerun.exe 123 15.59 ns 0.133 ns 0.118 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 123 10.67 ns 0.048 ns 0.045 ns 0.68
ParseHex Job-FGNQZB \1-main\corerun.exe 123456789 19.80 ns 0.158 ns 0.148 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 123456789 17.39 ns 0.098 ns 0.082 ns 0.88
ParseHex Job-FGNQZB \1-main\corerun.exe 1234567890ABCDEF 25.11 ns 0.208 ns 0.195 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 1234567890ABCDEF 20.42 ns 0.164 ns 0.153 ns 0.81
ParseHex Job-FGNQZB \1-main\corerun.exe 12345(...)45678 [24] 31.16 ns 0.146 ns 0.122 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 12345(...)45678 [24] 19.53 ns 0.233 ns 0.194 ns 0.63
ParseHex Job-FGNQZB \1-main\corerun.exe 1234(...)CDEF [315] 274.36 ns 3.454 ns 3.062 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 1234(...)CDEF [315] 41.58 ns 0.367 ns 0.307 ns 0.15
ParseHex Job-FGNQZB \1-main\corerun.exe 80000000 19.08 ns 0.092 ns 0.077 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 80000000 14.83 ns 0.198 ns 0.176 ns 0.78
ParseHex Job-FGNQZB \1-main\corerun.exe FEDCBA9876543210 26.67 ns 0.180 ns 0.160 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe FEDCBA9876543210 21.57 ns 0.109 ns 0.091 ns 0.81
ParseHex Job-FGNQZB \1-main\corerun.exe FEDC(...)4321 [315] 281.83 ns 2.357 ns 2.205 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe FEDC(...)4321 [315] 46.24 ns 0.912 ns 0.976 ns 0.16
ParseHex Job-FGNQZB \1-main\corerun.exe FFFE00000 21.26 ns 0.162 ns 0.144 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe FFFE00000 16.44 ns 0.060 ns 0.056 ns 0.77

Please run outer loop test to ensure more coverage of parsing.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Dec 2, 2023
@ghost
Copy link

ghost commented Dec 2, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

Instead of counting by digits, the new algorithm parses with uint blocks. It also uses vectorized hex converting for large numbers.

Introduces a new reference of S.R.Intrinsics into S.R.Numerics. I think it's expected if we start to use more SIMD operations for BigInteger.

Performance is measured on different sizes and corner cases, to ensure there's no regression on small values:

Method Job Toolchain input Mean Error StdDev Ratio
ParseHex Job-FGNQZB \1-main\corerun.exe 123 15.59 ns 0.133 ns 0.118 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 123 10.67 ns 0.048 ns 0.045 ns 0.68
ParseHex Job-FGNQZB \1-main\corerun.exe 123456789 19.80 ns 0.158 ns 0.148 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 123456789 17.39 ns 0.098 ns 0.082 ns 0.88
ParseHex Job-FGNQZB \1-main\corerun.exe 1234567890ABCDEF 25.11 ns 0.208 ns 0.195 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 1234567890ABCDEF 20.42 ns 0.164 ns 0.153 ns 0.81
ParseHex Job-FGNQZB \1-main\corerun.exe 12345(...)45678 [24] 31.16 ns 0.146 ns 0.122 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 12345(...)45678 [24] 19.53 ns 0.233 ns 0.194 ns 0.63
ParseHex Job-FGNQZB \1-main\corerun.exe 1234(...)CDEF [315] 274.36 ns 3.454 ns 3.062 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 1234(...)CDEF [315] 41.58 ns 0.367 ns 0.307 ns 0.15
ParseHex Job-FGNQZB \1-main\corerun.exe 80000000 19.08 ns 0.092 ns 0.077 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe 80000000 14.83 ns 0.198 ns 0.176 ns 0.78
ParseHex Job-FGNQZB \1-main\corerun.exe FEDCBA9876543210 26.67 ns 0.180 ns 0.160 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe FEDCBA9876543210 21.57 ns 0.109 ns 0.091 ns 0.81
ParseHex Job-FGNQZB \1-main\corerun.exe FEDC(...)4321 [315] 281.83 ns 2.357 ns 2.205 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe FEDC(...)4321 [315] 46.24 ns 0.912 ns 0.976 ns 0.16
ParseHex Job-FGNQZB \1-main\corerun.exe FFFE00000 21.26 ns 0.162 ns 0.144 ns 1.00
ParseHex Job-AZXDDT \5-vector-generic\corerun.exe FFFE00000 16.44 ns 0.060 ns 0.056 ns 0.77

Please run outer loop test to ensure more coverage of parsing.

Author: huoyaoyuan
Assignees: -
Labels:

area-System.Numerics, community-contribution

Milestone: -

Comment on lines 1337 to 1338
static virtual bool TryParseSingleBlock(ReadOnlySpan<TChar> input, out uint result)
=> TParsingInfo.TryParseUnalignedBlock(input, out result);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leaves the space for vectorized Vector128<char> -> uint conversion. This may or may not be necessary if such optimization is done in uint side.

Comment on lines 1374 to 1386
if (Convert.FromHexString(MemoryMarshal.Cast<TChar, char>(input), MemoryMarshal.AsBytes(destiniation), out _, out _) != OperationStatus.Done)
{
return false;
}

if (BitConverter.IsLittleEndian)
{
MemoryMarshal.AsBytes(destiniation).Reverse();
}
else
{
destiniation.Reverse();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be improved if there's a vectorized path that parses in reverse byte order. Performance for huge numbers should be improved, however for the most interesting 96-256bit cases, I'd expect the performance comparison to be very complicated. Thus I'm choosing the simplest approach to depend on public API.

Convert.FromHexString has a slight overhead over HexConverter, but the latter is only vectorized in CoreLib.

@@ -1342,7 +1342,7 @@ static virtual bool TryParseWholeBlocks(ReadOnlySpan<TChar> input, Span<uint> de
Debug.Assert(destiniation.Length * TParser.DigitsPerBlock == input.Length);
ref TChar lastWholeBlockStart = ref Unsafe.Add(ref MemoryMarshal.GetReference(input), input.Length - TParser.DigitsPerBlock);

for (int i = 0; i < destiniation.Length - 1; i++)
for (int i = 0; i < destiniation.Length; i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not something to handle in this PR, but I noticed this is destiniation not destination 😆
(we can handle it separately after this goes in to avoid making it harder to review)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, never mind, this is net new code and I was looking at the wrong diff view

It'd be great to fix the minor type in this PR then.

Comment on lines 131 to 133
Vector512<uint> vector = Vector512.LoadUnsafe(ref start, (nuint)offset);
Vector512<uint> complement = Vector512.OnesComplement(vector);
Vector512.StoreUnsafe(complement, ref start, (nuint)offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Vector512<uint> vector = Vector512.LoadUnsafe(ref start, (nuint)offset);
Vector512<uint> complement = Vector512.OnesComplement(vector);
Vector512.StoreUnsafe(complement, ref start, (nuint)offset);
Vector512<uint> vector = ~Vector512.LoadUnsafe(ref start, (nuint)offset);
vector.StoreUnsafe(ref start, (nuint)offset);

offset += Vector256<uint>.Count;
}

while (Vector128.IsHardwareAccelerated && d.Length - offset >= Vector128<uint>.Count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is correct, but it's also slightly pessimized as it's going to hit multiple mispredicted branches due to the loops and for small payloads.

We could easily get extra perf by refactoring it to be done a bit differently. That can always be done separately, however.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe TensorPrimitives can provide optimized code for this pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is TensorPrimitives.OnesComplement. Do you think S.R.Numerics should start to take dependency on TensorPrimitives?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has to wait at least until Tensor Primitives is in box

@tannergooding
Copy link
Member

CC. @stephentoub, could you give this a secondary review since you helped with the primitive integer parsing logic as well?

@huoyaoyuan
Copy link
Member Author

#95402 also touches the generic parser related pattern for BigInteger. Could you provide some insights as well? Thanks!

@huoyaoyuan
Copy link
Member Author

Convert to draft to do more performance improvements.

@huoyaoyuan
Copy link
Member Author

Latest performance numbers:

Method Job Toolchain input Mean Error StdDev Ratio
Parse Job-AJQWKE \PR\corerun.exe 123 11.49 ns 0.052 ns 0.046 ns 0.88
Parse Job-FEBANE \main\corerun.exe 123 13.05 ns 0.081 ns 0.076 ns 1.00
Parse Job-AJQWKE \PR\corerun.exe 123456789 17.91 ns 0.118 ns 0.110 ns 0.89
Parse Job-FEBANE \main\corerun.exe 123456789 20.04 ns 0.117 ns 0.109 ns 1.00
Parse Job-AJQWKE \PR\corerun.exe 1234567890ABCDEF 21.18 ns 0.097 ns 0.091 ns 0.86
Parse Job-FEBANE \main\corerun.exe 1234567890ABCDEF 24.69 ns 0.183 ns 0.171 ns 1.00
Parse Job-AJQWKE \PR\corerun.exe 12345(...)23456 [22] 20.15 ns 0.128 ns 0.120 ns 0.67
Parse Job-FEBANE \main\corerun.exe 12345(...)23456 [22] 30.04 ns 0.229 ns 0.203 ns 1.00
Parse Job-AJQWKE \PR\corerun.exe 80000000 13.76 ns 0.039 ns 0.035 ns 0.70
Parse Job-FEBANE \main\corerun.exe 80000000 19.78 ns 0.132 ns 0.123 ns 1.00
Parse Job-AJQWKE \PR\corerun.exe FEDCBA9876543210 23.59 ns 0.102 ns 0.090 ns 0.92
Parse Job-FEBANE \main\corerun.exe FEDCBA9876543210 25.56 ns 0.093 ns 0.087 ns 1.00
Parse Job-AJQWKE \PR\corerun.exe FFFE00000 18.34 ns 0.092 ns 0.082 ns 0.94
Parse Job-FEBANE \main\corerun.exe FFFE00000 19.44 ns 0.131 ns 0.123 ns 1.00

I'm not experienced about branch tuning. I think this is all what I can do now.

@huoyaoyuan huoyaoyuan marked this pull request as ready for review February 2, 2024 17:27
@huoyaoyuan
Copy link
Member Author

The test failure looks unrelated now.

blockCount = Math.DivRem(totalDigitCount, DigitsPerBlock, out int remainder);
if (remainder == 0)
uint leading = signBits;
// First parse unanligned leading block if exists.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// First parse unanligned leading block if exists.
// First parse unaligned leading block if exists.

@tannergooding tannergooding merged commit 7901202 into dotnet:main Feb 3, 2024
108 of 111 checks passed
@huoyaoyuan huoyaoyuan deleted the biginteger-hex-vectorize branch February 4, 2024 02:48
@github-actions github-actions bot locked and limited conversation to collaborators Mar 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants