Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading SpanHelpers with Vector512 #86655

Merged
merged 7 commits into from
Jun 27, 2023

Conversation

DeepakRajendrakumaran
Copy link
Contributor

@DeepakRajendrakumaran DeepakRajendrakumaran commented May 23, 2023

What this PR does:

  1. This PR upgrades Span implementation by enabling acceleration using AVX512 instructions when possible. This is achieved using. This is achieved by adding Vector512 paths in relevant SpanHelper implementations.
  2. It also modified Vector512.IsHardwareAccelerated implementation so that it now returns TRUE only on targets where there is no throttling for AVX512(this is based on discussion with Tanner)

Performance testing for the PR

Ran Microbenchmark( --filter "System.Memory.*") and compared with main branch : This was with following thresholds for ResultComparator on ICX( --threshold 5% --noise 1ns). I have re-run the ones showing regression locally and DO NOT see any which are consistently slower
image

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label May 23, 2023
@ghost
Copy link

ghost commented May 23, 2023

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

This is a draft PR created to get some feedback from @tannergooding re some specific questions I have

  1. IndexOfNullByte() : Algorithm used tries to align vectorLength. It feels a little clumsy when extending for Vector512. Does this look alright or do I try to further optimize it?
  2. Reverse() - Current implementation for AVX2 uses a combination of shuffle and permute. We have a vectorized implementation for Vector512()(https://github.com/dotnet/runtime/pull/85129/files#diff-8637c5f447a4d14a3186228689895867f51da490cc1815f164af705f783d8cf2) which uses permw, permd, permq, permps, permpd. So, I assume using Vector512.Shuffle() directly is an acceptable solution for non-byte scenarios. For Vector512<byte>.Shuffle(), we default to a software fallback since vpermb requires vbmi support. I was originally going to work around it(combination of AVX512F.Shuffle() + permutexvar_epi64 but noticed that you have added VBMI support. Is it okay to update Vector512.Shuffle() to use vpermb if VMBI is supported and use Vector512.Shuffle() for Vector512<byte>.Shuffle()?
Author: DeepakRajendrakumaran
Assignees: -
Labels:

area-System.Memory

Milestone: -

@DeepakRajendrakumaran DeepakRajendrakumaran marked this pull request as draft May 23, 2023 17:53
Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had just a quick view -- same applies to the char variant.

// Find the last unique (which is not equal to ch1) byte
// the algorithm is fine if both are equal, just a little bit less efficient
byte ch2Val = Unsafe.Add(ref value, valueTailLength);
nint ch1ch2Distance = valueTailLength;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueTailLength is of type int, so to avoid the sign-extending move use

Suggested change
nint ch1ch2Distance = valueTailLength;
nint ch1ch2Distance = (nint)(uint)valueTailLength;

See codegen

goto CANDIDATE_FOUND;
}

LOOP_FOOTER:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codegen-wise: is it good to have these labels in the loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is sort of hand-written PGO, less relevant with Dynamic PGO being enabled by default, but we have other targets.

@EgorBo
Copy link
Member

EgorBo commented May 23, 2023

Would be nice to see benchmarks for small inputs (which are more commonly used in these perf-sensitive primitives), maybe it even makes sense to wrap avx-512 path into a call?..

@DeepakRajendrakumaran
Copy link
Contributor Author

Would be nice to see benchmarks for small inputs (which are more commonly used in these perf-sensitive primitives), maybe it even makes sense to wrap avx-512 path into a call?..

Yes. That's part of the plan. Currently ran it only for IndexOf() for Byte. Will add for rest soon

image

Comment on lines 23594 to 23556
else if (elementSize == 1)
{
for (uint32_t i = 0; i < elementCount; i++)
{
vecCns.u8[i] = (uint8_t)(vecCns.u8[i * elementSize] / elementSize);
}

op2 = gtNewVconNode(type);
op2->AsVecCon()->gtSimdVal = vecCns;

// swap the operands to match the encoding requirements
retNode = gtNewSimdHWIntrinsicNode(type, op2, op1, NI_AVX512VBMI_PermuteVar64x8, simdBaseJitType, simdSize);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to extract this acceleration into its own PR and also cover Vector256 at the same time, if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to separate PR

@@ -2361,7 +2361,7 @@ GenTree* Compiler::impSpecialIntrinsic(NamedIntrinsic intrinsic,
}
else if (simdSize == 64)
{
if (varTypeIsByte(simdBaseType))
if (varTypeIsByte(simdBaseType) && (!compExactlyDependsOn(InstructionSet_AVX512VBMI)))
Copy link
Member

@tannergooding tannergooding May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Unnecessary parentheses and we want an opportunistic check, since the fallback behavior is identical, just slower:

Suggested change
if (varTypeIsByte(simdBaseType) && (!compExactlyDependsOn(InstructionSet_AVX512VBMI)))
if (varTypeIsByte(simdBaseType) && !compOpportunisticallyDependsOn(InstructionSet_AVX512VBMI))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of different PR

Comment on lines +70 to +73
byte ch2Val = Unsafe.Add(ref value, valueTailLength);
nint ch1ch2Distance = (nint)(uint)valueTailLength;
while (ch2Val == value && ch1ch2Distance > 1)
ch2Val = Unsafe.Add(ref value, --ch1ch2Distance);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing for you to do here, just calling out that this kind of duplication between Vector512/256/128 is another reason why having an ISimdVector would be nice: #76423

It might be feasible for us to define/expose for internal use only for the time being. Interested in @stephentoub's thoughts on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly worth experimenting with internally at first.

@tannergooding
Copy link
Member

The changes overall LGTM, there's just quite a lot of essentially duplicated code that we can't currently share between Vector512/256/128. I think we want to also consider how to avoid this negatively impacting 1st gen AVX-512 hardware and so I proposed an internal only property to hang off X86Base. We could consider exposing that more broadly as well, but that's a different discussion.

@DeepakRajendrakumaran DeepakRajendrakumaran force-pushed the spanHelperUpgrade branch 2 times, most recently from 3abe035 to 8903c22 Compare May 31, 2023 16:52
@DeepakRajendrakumaran DeepakRajendrakumaran force-pushed the spanHelperUpgrade branch 3 times, most recently from d3c2331 to bc62579 Compare June 7, 2023 23:29
@DeepakRajendrakumaran DeepakRajendrakumaran marked this pull request as ready for review June 14, 2023 16:25
@BruceForstall BruceForstall mentioned this pull request Jun 15, 2023
56 tasks
@DeepakRajendrakumaran DeepakRajendrakumaran force-pushed the spanHelperUpgrade branch 2 times, most recently from 3b99276 to 6a2ec3a Compare June 16, 2023 17:28
@DeepakRajendrakumaran
Copy link
Contributor Author

Ran Microbenchmark( --filter "System.Memory.*") and compared with main branch : This was with following thresholds for ResultComparator( --threshold 5% --noise 1ns)

image

@DeepakRajendrakumaran
Copy link
Contributor Author

@tannergooding Let me know if ther are any other changes you want me to do for this?

@tannergooding
Copy link
Member

@stephentoub, @jeffhandley

This should be ready for merge, just would be good to get a secondary sign-off. There's no perf diff for hardware without AVX-512 support since the entire code path is treated as dead code. Numbers generally look good for hardware with AVX-512 support with the additional branch causing some very minor TP hits for < 64 bytes of data.

The code is effectively the V256 code path, but duplicated. We can reduce this duplication longer term with the ISimdVector<TSelf, TElement> approach, but that should be its own PR.

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Jun 23, 2023

@stephentoub @tannergooding Bump. Anything else required from me here?

@BruceForstall
Copy link
Member

@stephentoub @tannergooding Do you have any further requests or comments on this PR? E.g., any request for additional testing or performance analysis? Or is it ready to be merged?

@danmoseley
Copy link
Member

Nice wins thanks @DeepakRajendrakumaran !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory avx512 Related to the AVX-512 architecture community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants