Upgrading SpanHelpers with Vector512 #86655

DeepakRajendrakumaran · 2023-05-23T17:37:25Z

What this PR does:

This PR upgrades Span implementation by enabling acceleration using AVX512 instructions when possible. This is achieved using. This is achieved by adding Vector512 paths in relevant SpanHelper implementations.
It also modified Vector512.IsHardwareAccelerated implementation so that it now returns TRUE only on targets where there is no throttling for AVX512(this is based on discussion with Tanner)

Performance testing for the PR

Ran Microbenchmark( --filter "System.Memory.*") and compared with main branch : This was with following thresholds for ResultComparator on ICX( --threshold 5% --noise 1ns). I have re-run the ones showing regression locally and DO NOT see any which are consistently slower

ghost · 2023-05-23T17:37:42Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

This is a draft PR created to get some feedback from @tannergooding re some specific questions I have

IndexOfNullByte() : Algorithm used tries to align vectorLength. It feels a little clumsy when extending for Vector512. Does this look alright or do I try to further optimize it?
Reverse() - Current implementation for AVX2 uses a combination of shuffle and permute. We have a vectorized implementation for Vector512()(https://github.com/dotnet/runtime/pull/85129/files#diff-8637c5f447a4d14a3186228689895867f51da490cc1815f164af705f783d8cf2) which uses permw, permd, permq, permps, permpd. So, I assume using Vector512.Shuffle() directly is an acceptable solution for non-byte scenarios. For Vector512<byte>.Shuffle(), we default to a software fallback since vpermb requires vbmi support. I was originally going to work around it(combination of AVX512F.Shuffle() + permutexvar_epi64 but noticed that you have added VBMI support. Is it okay to update Vector512.Shuffle() to use vpermb if VMBI is supported and use Vector512.Shuffle() for Vector512<byte>.Shuffle()?

Author:	DeepakRajendrakumaran
Assignees:	-
Labels:	`area-System.Memory`
Milestone:	-

gfoidl

Had just a quick view -- same applies to the char variant.

gfoidl · 2023-05-23T19:03:40Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

+                // Find the last unique (which is not equal to ch1) byte
+                // the algorithm is fine if both are equal, just a little bit less efficient
+                byte ch2Val = Unsafe.Add(ref value, valueTailLength);
+                nint ch1ch2Distance = valueTailLength;


valueTailLength is of type int, so to avoid the sign-extending move use

Suggested change

nint ch1ch2Distance = valueTailLength;

nint ch1ch2Distance = (nint)(uint)valueTailLength;

See codegen

gfoidl · 2023-05-23T19:06:52Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

+                        goto CANDIDATE_FOUND;
+                    }
+
+                LOOP_FOOTER:


Codegen-wise: is it good to have these labels in the loop?

It is sort of hand-written PGO, less relevant with Dynamic PGO being enabled by default, but we have other targets.

EgorBo · 2023-05-23T19:42:10Z

Would be nice to see benchmarks for small inputs (which are more commonly used in these perf-sensitive primitives), maybe it even makes sense to wrap avx-512 path into a call?..

DeepakRajendrakumaran · 2023-05-23T20:39:21Z

Would be nice to see benchmarks for small inputs (which are more commonly used in these perf-sensitive primitives), maybe it even makes sense to wrap avx-512 path into a call?..

Yes. That's part of the plan. Currently ran it only for IndexOf() for Byte. Will add for rest soon

tannergooding · 2023-05-30T20:36:18Z

src/coreclr/jit/gentree.cpp

+        else if (elementSize == 1)
+        {
+            for (uint32_t i = 0; i < elementCount; i++)
+            {
+                vecCns.u8[i] = (uint8_t)(vecCns.u8[i * elementSize] / elementSize);
+            }
+
+            op2                        = gtNewVconNode(type);
+            op2->AsVecCon()->gtSimdVal = vecCns;
+
+            // swap the operands to match the encoding requirements
+            retNode = gtNewSimdHWIntrinsicNode(type, op2, op1, NI_AVX512VBMI_PermuteVar64x8, simdBaseJitType, simdSize);
+        }


It would be nice to extract this acceleration into its own PR and also cover Vector256 at the same time, if possible.

Moved to separate PR

tannergooding · 2023-05-30T20:36:46Z

src/coreclr/jit/hwintrinsicxarch.cpp

@@ -2361,7 +2361,7 @@ GenTree* Compiler::impSpecialIntrinsic(NamedIntrinsic        intrinsic,
            }
            else if (simdSize == 64)
            {
-                if (varTypeIsByte(simdBaseType))
+                if (varTypeIsByte(simdBaseType) && (!compExactlyDependsOn(InstructionSet_AVX512VBMI)))


nit: Unnecessary parentheses and we want an opportunistic check, since the fallback behavior is identical, just slower:

Suggested change

if (varTypeIsByte(simdBaseType) && (!compExactlyDependsOn(InstructionSet_AVX512VBMI)))

if (varTypeIsByte(simdBaseType) && !compOpportunisticallyDependsOn(InstructionSet_AVX512VBMI))

Part of different PR

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

tannergooding · 2023-05-30T20:44:48Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

+                byte ch2Val = Unsafe.Add(ref value, valueTailLength);
+                nint ch1ch2Distance = (nint)(uint)valueTailLength;
+                while (ch2Val == value && ch1ch2Distance > 1)
+                    ch2Val = Unsafe.Add(ref value, --ch1ch2Distance);


Nothing for you to do here, just calling out that this kind of duplication between Vector512/256/128 is another reason why having an ISimdVector would be nice: #76423

It might be feasible for us to define/expose for internal use only for the time being. Interested in @stephentoub's thoughts on it.

Certainly worth experimenting with internally at first.

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

tannergooding · 2023-05-30T20:48:45Z

The changes overall LGTM, there's just quite a lot of essentially duplicated code that we can't currently share between Vector512/256/128. I think we want to also consider how to avoid this negatively impacting 1st gen AVX-512 hardware and so I proposed an internal only property to hang off X86Base. We could consider exposing that more broadly as well, but that's a different discussion.

DeepakRajendrakumaran · 2023-06-16T18:48:26Z

Ran Microbenchmark( --filter "System.Memory.*") and compared with main branch : This was with following thresholds for ResultComparator( --threshold 5% --noise 1ns)

DeepakRajendrakumaran · 2023-06-20T16:24:59Z

@tannergooding Let me know if ther are any other changes you want me to do for this?

tannergooding · 2023-06-20T18:19:49Z

@stephentoub, @jeffhandley

This should be ready for merge, just would be good to get a secondary sign-off. There's no perf diff for hardware without AVX-512 support since the entire code path is treated as dead code. Numbers generally look good for hardware with AVX-512 support with the additional branch causing some very minor TP hits for < 64 bytes of data.

The code is effectively the V256 code path, but duplicated. We can reduce this duplication longer term with the ISimdVector<TSelf, TElement> approach, but that should be its own PR.

src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Packed.cs

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs

… Vector512Throttling issues

DeepakRajendrakumaran · 2023-06-23T16:32:59Z

@stephentoub @tannergooding Bump. Anything else required from me here?

BruceForstall · 2023-06-26T22:53:36Z

@stephentoub @tannergooding Do you have any further requests or comments on this PR? E.g., any request for additional testing or performance analysis? Or is it ready to be merged?

danmoseley · 2023-07-01T00:12:54Z

Nice wins thanks @DeepakRajendrakumaran !

dotnet-issue-labeler bot added the area-System.Memory label May 23, 2023

ghost added the community-contribution Indicates that the PR has been added by a community member label May 23, 2023

DeepakRajendrakumaran marked this pull request as draft May 23, 2023 17:53

gfoidl reviewed May 23, 2023

View reviewed changes

build-analysis bot mentioned this pull request May 23, 2023

CI System.Text.RegularExpressions.Tests failure #84323

Closed

DeepakRajendrakumaran force-pushed the spanHelperUpgrade branch from 4bcdb12 to aa1e03d Compare May 24, 2023 18:41

build-analysis bot mentioned this pull request May 25, 2023

Tracking issue for CI build timeouts #76454

Closed

DeepakRajendrakumaran force-pushed the spanHelperUpgrade branch from aa1e03d to bda9c39 Compare May 25, 2023 01:42

This was referenced May 25, 2023

Failed USB connection via port 54050, error 61, in tvOS arm64 Release AllSubsets_Mono #82637

Open

Assert failure in GC/API/NoGCRegion/Callback_Svr test #86612

Closed

runfoapp bot mentioned this pull request May 25, 2023

Infra improvements for Helix #68176

Closed

DeepakRajendrakumaran force-pushed the spanHelperUpgrade branch 3 times, most recently from ed3f25f to f7ed104 Compare May 30, 2023 18:06

tannergooding added the avx512 Related to the AVX-512 architecture label May 30, 2023

tannergooding reviewed May 30, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs Show resolved Hide resolved

tannergooding reviewed May 30, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs Outdated Show resolved Hide resolved

DeepakRajendrakumaran force-pushed the spanHelperUpgrade branch 2 times, most recently from 3abe035 to 8903c22 Compare May 31, 2023 16:52

DeepakRajendrakumaran force-pushed the spanHelperUpgrade branch 3 times, most recently from d3c2331 to bc62579 Compare June 7, 2023 23:29

build-analysis bot mentioned this pull request Jun 8, 2023

MsQuicPlatformDetectionTests failures in CI #87275

Closed

DeepakRajendrakumaran marked this pull request as ready for review June 14, 2023 16:25

BruceForstall mentioned this pull request Jun 15, 2023

Implement AVX-512 support #77034

Closed

56 tasks

DeepakRajendrakumaran force-pushed the spanHelperUpgrade branch 2 times, most recently from 3b99276 to 6a2ec3a Compare June 16, 2023 17:28

tannergooding approved these changes Jun 20, 2023

View reviewed changes

stephentoub reviewed Jun 20, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs Outdated Show resolved Hide resolved

stephentoub reviewed Jun 21, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Packed.cs Show resolved Hide resolved

stephentoub reviewed Jun 21, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs Outdated Show resolved Hide resolved

DeepakRajendrakumaran force-pushed the spanHelperUpgrade branch from 6a2ec3a to 2f6889a Compare June 21, 2023 16:22

DeepakRajendrakumaran added 7 commits June 21, 2023 09:24

Adding required internal library methods to support Vector512.

7706131

Making Vector512.IsHardwareAccelerated return 'False' on targets with…

ff65162

… Vector512Throttling issues

SpanHelper library upgrades.

8996d0d

Using AVX512 directly in packed implementation

28c80e5

Fixing cpuid tests

26aac45

Cleanup + Addressing review comments

1f3a533

Address review comments.

2f6889a

build-analysis bot mentioned this pull request Jun 21, 2023

System.Data.OleDb.Tests timeout in net48 x86 Release leg #87783

Open

stephentoub approved these changes Jun 27, 2023

View reviewed changes

stephentoub merged commit b16b29b into dotnet:main Jun 27, 2023

BruceForstall mentioned this pull request Jun 27, 2023

Upgrade all VMs to AMD64 Ubuntu 22.04 #86194

Merged

MihaZupan mentioned this pull request Jun 30, 2023

Missing CI test coverage for Vector512 #88233

Closed

BruceForstall mentioned this pull request Jun 30, 2023

Light up Span with Vector512 code paths. #80824

Closed

MihaZupan mentioned this pull request Jul 20, 2023

[perf] System.Memory.Span.** and System.Memory.ReadOnly.** performance regressions in Mono AOT x64 microbenchmarks from Preview 5 to Preview 7 #89247

Closed

ghost locked as resolved and limited conversation to collaborators Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading SpanHelpers with Vector512 #86655

Upgrading SpanHelpers with Vector512 #86655

DeepakRajendrakumaran commented May 23, 2023 •

edited

Loading

ghost commented May 23, 2023

gfoidl left a comment

gfoidl May 23, 2023

gfoidl May 23, 2023

EgorBo May 23, 2023

EgorBo commented May 23, 2023 •

edited

Loading

DeepakRajendrakumaran commented May 23, 2023

tannergooding May 30, 2023

DeepakRajendrakumaran Jun 6, 2023

tannergooding May 30, 2023 •

edited

Loading

DeepakRajendrakumaran Jun 12, 2023

tannergooding May 30, 2023

stephentoub Jun 9, 2023

tannergooding commented May 30, 2023

DeepakRajendrakumaran commented Jun 16, 2023

DeepakRajendrakumaran commented Jun 20, 2023

tannergooding commented Jun 20, 2023

DeepakRajendrakumaran commented Jun 23, 2023 •

edited

Loading

BruceForstall commented Jun 26, 2023

danmoseley commented Jul 1, 2023

	nint ch1ch2Distance = valueTailLength;
	nint ch1ch2Distance = (nint)(uint)valueTailLength;

	if (varTypeIsByte(simdBaseType) && (!compExactlyDependsOn(InstructionSet_AVX512VBMI)))
	if (varTypeIsByte(simdBaseType) && !compOpportunisticallyDependsOn(InstructionSet_AVX512VBMI))

Upgrading SpanHelpers with Vector512 #86655

Upgrading SpanHelpers with Vector512 #86655

Conversation

DeepakRajendrakumaran commented May 23, 2023 • edited Loading

What this PR does:

Performance testing for the PR

ghost commented May 23, 2023

gfoidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorBo commented May 23, 2023 • edited Loading

DeepakRajendrakumaran commented May 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding May 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented May 30, 2023

DeepakRajendrakumaran commented Jun 16, 2023

DeepakRajendrakumaran commented Jun 20, 2023

tannergooding commented Jun 20, 2023

DeepakRajendrakumaran commented Jun 23, 2023 • edited Loading

BruceForstall commented Jun 26, 2023

danmoseley commented Jul 1, 2023

DeepakRajendrakumaran commented May 23, 2023 •

edited

Loading

EgorBo commented May 23, 2023 •

edited

Loading

tannergooding May 30, 2023 •

edited

Loading

DeepakRajendrakumaran commented Jun 23, 2023 •

edited

Loading