Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector128.ShiftRightLogical doesn't use machine-instructions for some types (byte / sbyte) #75770

Closed
gfoidl opened this issue Sep 16, 2022 · 10 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@gfoidl
Copy link
Member

gfoidl commented Sep 16, 2022

Vector128.ShiftRightLogical has overloads for all primitive numeric types, and so it puts into mind that these operations are backed by efficient hw-code. This may be a trap for some types, as it falls back to a software solution.

While porting some code from SSE to xplat-instrinsics that made me wonder (why are there movsxd which shouldn't be there, and from where comes a loop where none should be) until I remembered that in Sse2 there's no such instruction...

This table summarizes the CQ for the numeric types (I don't know about AdvSimd, thus left that column out):

Type codegen Sse2-method available
byte
short ✔️ ✔️
int ✔️ ✔️
long ✔️ ✔️
nint ✔️
nuint ✔️
sbyte
ushort ✔️ ✔️
uint ✔️ ✔️
ulong ✔️ ✔️

As can be seen there's no strict correlation between codegen for the xplat-ShiftRightLogical and the availability of a Sse2-method (in both directions).

If this is not just a JIT's CQ-issue, then there should be any measure / indicator for the user to don't be surprised by the codegen or more important by perf.

We now push users towards using the xplat-instrinsics (which makes sense IMO), but I think we can't expect that every user inspects the generated machine-code.

Generalizing a bit: if there are known methods that aren't really intrinsified (by architecture, by design, ran out of time, etc.) maybe it's possible to provide any kind of hint in intellisense like it does for supported APIs:
grafik

Where it states something like:

SSE2 - intrinsified
AdvSimd - software fallback

Would it be enough for such info to have them in the xml doc-comments?

Having an analyzer that issues an suggestion / hint that software fallback will be used is another option.
To keep this up-to-date maybe some kind of meta-data would be needed on each Vector128-method that indicates the state of intrinsification. Maybe that overkill, though.

Repro

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoSse(Vector128<sbyte> vec)
{
    return Sse2.ShiftRightLogical(vec.AsInt32(), 4).AsSByte();
}

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoXplatNaive(Vector128<sbyte> vec)
{
    return Vector128.ShiftRightLogical(vec, 4);
}

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoXplat(Vector128<sbyte> vec)
{
    return Vector128.ShiftRightLogical(vec.AsInt32(), 4).AsSByte();
}

produces on .NET 7 RC 2

; Assembly listing for method Program:<<Main>$>g__DoSse|0_0(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper

G_M000_IG02:                ;; offset=0003H
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C5F972D004           vpsrld   xmm0, xmm0, 4
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=0013H
       C3                   ret

; Total bytes of code 20

; Assembly listing for method Program:<<Main>$>g__DoXplatNaive|0_1(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       4883EC48             sub      rsp, 72
       C5F877               vzeroupper
       488BF1               mov      rsi, rcx

G_M000_IG02:                ;; offset=000CH
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C5F929442420         vmovapd  xmmword ptr [rsp+20H], xmm0
       33FF                 xor      edi, edi

G_M000_IG03:                ;; offset=0018H
       488D4C2420           lea      rcx, bword ptr [rsp+20H]
       4863D7               movsxd   rdx, edi
       480FBE0C11           movsx    rcx, byte  ptr [rcx+rdx]
       BA04000000           mov      edx, 4
       FF1538261A00         call     [Scalar`1:ShiftRightLogical(byte,int):byte]
       488D542430           lea      rdx, bword ptr [rsp+30H]
       4863CF               movsxd   rcx, edi
       88040A               mov      byte  ptr [rdx+rcx], al
       FFC7                 inc      edi
       83FF10               cmp      edi, 16
       7CD6                 jl       SHORT G_M000_IG03

G_M000_IG04:                ;; offset=0042H
       C5F928442430         vmovapd  xmm0, xmmword ptr [rsp+30H]
       C5F91106             vmovupd  xmmword ptr [rsi], xmm0
       488BC6               mov      rax, rsi

G_M000_IG05:                ;; offset=004FH
       4883C448             add      rsp, 72
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret

; Total bytes of code 86

; Assembly listing for method Program:<<Main>$>g__DoXplat|0_2(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper

G_M000_IG02:                ;; offset=0003H
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C5F972D004           vpsrld   xmm0, xmm0, 4
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=0013H
       C3                   ret

; Total bytes of code 20

PS: didn't check Vector256 and the other shift-variants.

category:cq
theme:vector-codegen
skill-level:beginner
cost:small
impact:small

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 16, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 16, 2022
@ghost
Copy link

ghost commented Sep 16, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Vector128.ShiftRightLogical has overloads for all primitive numeric types, and so it puts into mind that these operations are backed by efficient hw-code. This may be a trap for some types, as it falls back to a software solution.

While porting some code from SSE to xplat-instrinsics that made me wonder (why are there movsxd which shouldn't be there, and from where comes a loop where none should be) until I remembered that in Sse2 there's no such instruction...

This table summarizes the CQ for the numeric types (I don't know about AdvSimd, thus left that column out):

Type codegen Sse2-method available
byte
short ✔️ ✔️
int ✔️ ✔️
long ✔️ ✔️
nint ✔️
nuint ✔️
sbyte
ushort ✔️ ✔️
uint ✔️ ✔️
ulong ✔️ ✔️

As can be seen there's no strict correlation between codegen for the xplat-ShiftRightLogical and the availability of a Sse2-method (in both directions).

If this is not just a JIT's CQ-issue, then there should be any measure / indicator for the user to don't be surprised by the codegen or more important by perf.

We now push users towards using the xplat-instrinsics (which makes sense IMO), but I think we can't expect that every user inspects the generated machine-code.

Generalizing a bit: if there are known methods that aren't really intrinsified (by architecture, by design, ran out of time, etc.) maybe it's possible to provide any kind of hint in intellisense like it does for supported APIs:
grafik

Where it states something like:

SSE2 - intrinsified
AdvSimd - software fallback

Would it be enough for such info to have them in the xml doc-comments?

Having an analyzer that issues an suggestion / hint that software fallback will be used is another option.
To keep this up-to-date maybe some kind of meta-data would be needed on each Vector128-method that indicates the state of intrinsification. Maybe that overkill, though.

Repro

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoSse(Vector128<sbyte> vec)
{
    return Sse2.ShiftRightLogical(vec.AsInt32(), 4).AsSByte();
}

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoXplatNaive(Vector128<sbyte> vec)
{
    return Vector128.ShiftRightLogical(vec, 4);
}

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<sbyte> DoXplat(Vector128<sbyte> vec)
{
    return Vector128.ShiftRightLogical(vec.AsInt32(), 4).AsSByte();
}

produces on .NET 7 RC 2

; Assembly listing for method Program:<<Main>$>g__DoSse|0_0(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper

G_M000_IG02:                ;; offset=0003H
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C5F972D004           vpsrld   xmm0, xmm0, 4
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=0013H
       C3                   ret

; Total bytes of code 20

; Assembly listing for method Program:<<Main>$>g__DoXplatNaive|0_1(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       4883EC48             sub      rsp, 72
       C5F877               vzeroupper
       488BF1               mov      rsi, rcx

G_M000_IG02:                ;; offset=000CH
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C5F929442420         vmovapd  xmmword ptr [rsp+20H], xmm0
       33FF                 xor      edi, edi

G_M000_IG03:                ;; offset=0018H
       488D4C2420           lea      rcx, bword ptr [rsp+20H]
       4863D7               movsxd   rdx, edi
       480FBE0C11           movsx    rcx, byte  ptr [rcx+rdx]
       BA04000000           mov      edx, 4
       FF1538261A00         call     [Scalar`1:ShiftRightLogical(byte,int):byte]
       488D542430           lea      rdx, bword ptr [rsp+30H]
       4863CF               movsxd   rcx, edi
       88040A               mov      byte  ptr [rdx+rcx], al
       FFC7                 inc      edi
       83FF10               cmp      edi, 16
       7CD6                 jl       SHORT G_M000_IG03

G_M000_IG04:                ;; offset=0042H
       C5F928442430         vmovapd  xmm0, xmmword ptr [rsp+30H]
       C5F91106             vmovupd  xmmword ptr [rsi], xmm0
       488BC6               mov      rax, rsi

G_M000_IG05:                ;; offset=004FH
       4883C448             add      rsp, 72
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret

; Total bytes of code 86

; Assembly listing for method Program:<<Main>$>g__DoXplat|0_2(Vector128`1):Vector128`1
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper

G_M000_IG02:                ;; offset=0003H
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C5F972D004           vpsrld   xmm0, xmm0, 4
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=0013H
       C3                   ret

; Total bytes of code 20
Author: gfoidl
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@danmoseley
Copy link
Member

Cc @tannergooding

@tannergooding
Copy link
Member

I think we can't expect that every user inspects the generated machine-code.

When writing perf oriented SIMD code, profiling and measuring the code is a minimum expectation. Realistically, inspecting machine code may also be appropriate in a number of scenarios given how such APIs are typically used.

Generalizing a bit: if there are known methods that aren't really intrinsified (by architecture, by design, ran out of time, etc.) maybe it's possible to provide any kind of hint in intellisense like it does for supported APIs:

There are different levels of intrinsification and many of these may change over time.

For example, ExtractMostSignificantBits is generally a single instruction on x86/x64. For some types the logic may be more complicated since there is no pmovmskw. However, on Arm64 there is no single instruction and it falls back to an optimized sequence instead.

Integer division doesn't have any instructions on any current platforms and we don't vectorize today. However, we could vectorize specific scenarios in the future depending on if one input is a constant.

ShiftLeftLogical is a case where everything except byte/sbyte is vectorized. We could and likely should be vectorizing those types as well, it just requires two shifts and a blend on x64 instead.

Would it be enough for such info to have them in the xml doc-comments?

I don't believe this is maintainable and won't map well across all possible runtimes that support SIMD (RyuJIT, Mono LLVM, Mono JIT, WASM, and eventually IL2CPP/Burst/etc).

Having an analyzer that issues an suggestion / hint that software fallback will be used is another option.
To keep this up-to-date maybe some kind of meta-data would be needed on each Vector128-method that indicates the state of intrinsification. Maybe that overkill, though.

We're working on adding an analyzer for the case where a method expects a constant input and one isn't given. Having a another analyzer to call out well-known cases where codegen could be suboptimal might be feasible.

That being said, nothing is going to win out over profiling and measuring your hot paths. Likewise, inspecting the disassembly (or using a tool like Intel VTune or AMD uProf) for SIMD code on platforms is highly recommended as it will help call out things that cannot be covered in an analyzer or which are prone to changes over time.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 16, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Sep 16, 2022
@JulieLeeMSFT
Copy link
Member

cc @dotnet/jit-contrib .

@gfoidl
Copy link
Member Author

gfoidl commented Sep 16, 2022

nothing is going to win out over profiling and measuring your hot paths. Likewise, inspecting the disassembly (or using a tool like Intel VTune or AMD uProf) for SIMD code on platforms is highly recommended

I fully agree with that (so I found the issue anyway).
Also I have to state that it is astonishing / unexpected if such an API exists -- no matter if generated code is inspected or not. It's like a tempting offer that doesn't hold what it promises 😉.

For inspecting disassembly, etc. I fear this is more a "insider thing" which a lot people that use C# / .NET won't do for different reasons, and then blame .NET for being slow. This is easier than hunting down and solving the problem -- e.g. by inspecting code and using more suitable instruction (we're aware of this).
So we should aid them as much as possible in not getting trapped in such situations.
Advanced users will find a better solution anyway, but maybe not the standard user? (I see these kind of things in my work in a german-language based forum).

Keep in mind that with making the use of vectorized instructions easier and pushing people towards using it (which is great), many of them maybe won't have deep knowledge in that simd-area, but expect that Vector128, et. al. work as promised.

We're working on adding an analyzer for the case where a method expects a constant input and one isn't given. Having a another analyzer to call out well-known cases where codegen could be suboptimal might be feasible.

👍🏻

Realistically, inspecting machine code may also be appropriate

Fortunately this got a lot easier with BDN and especially with DOTNET_JitDisasm (no need for checked build anymore -- for these kind of analysis).


Just found a similar "problem" with Vector128.Shuffle -- for sbyte falls back to software.

@tannergooding
Copy link
Member

Also I have to state that it is astonishing / unexpected if such an API exists -- no matter if generated code is inspected or not. It's like a tempting offer that doesn't hold what it promises 😉.

The basic premise is that not all platforms are created equal and providing a guaranteed xplat baseline is impossible. This effectively reduces it to one of three choices:
A) We provide nothing, devs must always use platform specific APIs
B) We provide an extremely limited subset of functionality
C) We provide a general set of core functionality that most platforms support

We opted for C and this is the same option that was taken for System.Numerics.Vector<T> and the other System.Numerics.Vector2/3/4 types.

Ultimately it is a tool and while it generally does the correct thing, there are cases where that tool may not be as optimal or may even be a worse choice than an alternative. As a low-level/perf oriented API, there is some onus on the user to use it "correctly" and to learn some of the minor nuance that exists between different platforms.

We believe that providing nothing or providing an extremely limited subset (which would effectively be int and float) would have a been an overall much worse experience.

Advanced users will find a better solution anyway, but maybe not the standard user? (I see these kind of things in my work in a german-language based forum).

Keep in mind that with making the use of vectorized instructions easier and pushing people towards using it (which is great), many of them maybe won't have deep knowledge in that simd-area, but expect that Vector128, et. al. work as promised.

While there certainly may be more users that start taking it up, it remains an intermediate to advanced scenario that will require some learning and getting used to. As with any API we expose, there may be potential pitfalls or gotchas to keep in mind and it is ultimately no "worse" than some of the same pitfalls that exist in much more broadly used APIs like LINQ or even the basic collection types like List<T>.

We have discussed providing more comprehensive docs that will help users get ramped up on writing SIMD code and many of the considerations they may have to take into account that can vastly differ from standard "scalar" logic they are used to writing. Part of that would be the process in which to take "scalar" algorithms and the steps required to "vectorize" them. But also would be some basic coverage on how to properly test and profile as well as pointers to tooling that can help analyze the disassembly (at a high level) to look for potential negative patterns.

@gfoidl
Copy link
Member Author

gfoidl commented Sep 16, 2022

C) We provide a general set of core functionality that most platforms support

That was a good decision.

some basic coverage on how to properly test and profile as well as pointers to tooling that can help analyze the disassembly (at a high level) to look for potential negative patterns.

Sounds perfect -- thanks for the response.

@BruceForstall
Copy link
Member

ShiftLeftLogical is a case where everything except byte/sbyte is vectorized. We could and likely should be vectorizing those types as well, it just requires two shifts and a blend on x64 instead.

@tannergooding Should this issue remain open to cover implementing this specific case?

@tannergooding
Copy link
Member

Yes. The change isn't really difficult to do either, particularly with the gtNewSimdBinOp and other helpers we have now.

@tannergooding
Copy link
Member

This was handled in #86841

@ghost ghost locked as resolved and limited conversation to collaborators Sep 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

5 participants