Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the appropriate ABI handling for the SIMD HWIntrinsic types #9578

Open
tannergooding opened this issue Jan 20, 2018 · 26 comments
Open

Add the appropriate ABI handling for the SIMD HWIntrinsic types #9578

tannergooding opened this issue Jan 20, 2018 · 26 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@tannergooding
Copy link
Member

tannergooding commented Jan 20, 2018

The SIMD HWIntrinsic types (Vector64<T>. Vector128<T>, and Vector256<T>) are special and represent the __m64, __m128, and __m256 ABI types.

These types have special handling in both the System V and Windows ABI and are treated as "scalar" (e.g. non aggregate and non union) types for the purpose of parameter passing or value returns. They additionally play some role in the selection of MultiReg or HVA (also known as HFA) structs.

We should add the appropriate support for these types to ensure we are meeting the requirement of the underlying ABI for a given platform/system.

category:correctness
theme:runtime
skill-level:expert
cost:large
impact:medium

@jkotas
Copy link
Member

jkotas commented Jan 20, 2018

This would change calling convention for these types. In general, calling convention changes invalidate pre-compile code out there. I think it would be a good idea to disable loading of these types during crossgen so that they are not baked into any precompiled code and we do not have our hands tied with changing the calling convention.

jkotas referenced this issue in jkotas/coreclr Jan 20, 2018
The timezone ids used case insensitive comparisons everywhere, except in the dictionary used to cache timezones.

Fixes dotnet/coreclr#15943
@tannergooding
Copy link
Member Author

@jkotas, Is there a good example of existing types disabled during crossgen?

I would like to include that change in dotnet/coreclr#15942, if possible.

@jkotas
Copy link
Member

jkotas commented Jan 20, 2018

Take a look how Vector<T> is handled in vm\methodtablebuilder.cpp. Look for IDS_EE_SIMD_NGEN_DISALLOWED.

@tannergooding
Copy link
Member Author

@jkotas, @CarolEidt. Part of the ABI work for these types is respecting their larger packing sizes (8 for __m64, 16 for __m128, 32 for __m256).

Do you think it is reasonable to have the packing sizes respected for v1 (it looks like it only needs a relatively small update in the VM layer)?

@jkotas
Copy link
Member

jkotas commented Jan 22, 2018

Do you think it is reasonable to have the packing sizes respected for v1

I think it is reasonable.

@AndyAyersMS
Copy link
Member

Should this be in 2.1 (not 2.0.x) ?

@CarolEidt
Copy link
Contributor

Should this be in 2.1 (not 2.0.x) ?

Yes, I believe so.

@tannergooding
Copy link
Member Author

An explicit example of where the current ABI is wrong is for x64 Windows with SIMD returns.

The default calling convention for x64 Windows specifies that __m128, __m128i, and __m128d are returned in XMM0: https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019#return-values

These types correspond to the System.Runtime.Intrinsics.Vector128 type on the managed side and it is not currently being returned in XMM0.

dotnet/coreclr#23899 adds support for passing Vector128 across interop boundaries and so this will need to be correctly handled.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@CarolEidt CarolEidt self-assigned this Jun 25, 2020
@CarolEidt CarolEidt modified the milestones: Future, 5.0.0 Jun 25, 2020
@CarolEidt CarolEidt modified the milestones: 5.0.0, 6.0.0 Jul 14, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@CarolEidt CarolEidt removed their assignment Dec 4, 2020
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@sandreenko sandreenko removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 3, 2021
@sandreenko sandreenko modified the milestones: 6.0.0, 7.0.0 Jun 3, 2021
@sandreenko sandreenko removed their assignment Nov 13, 2021
@AndyAyersMS
Copy link
Member

@tannergooding is this still relevant?

@tannergooding
Copy link
Member Author

Yes. We still need to ensure that these types are being correctly handled at the ABI level so we can enable them for interop scenarios and so we know we have good/correct/performant codegen even in managed only land.

@jakobbotsch
Copy link
Member

This won't make 7.0.

@jakobbotsch jakobbotsch modified the milestones: 7.0.0, 8.0.0 May 30, 2022
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 24, 2023
@tgjones
Copy link
Contributor

tgjones commented Apr 9, 2023

I came up with a hacky workaround for Vector128<T> value returns on Windows x64 in interop scenarios, which I'm posting here in case it's helpful for others: https://gist.github.com/tgjones/2e8ea8e365837e67c1f2d8d67d92105e

I use a custom marshaller for the return type of a [LibraryImport] declaration:

[LibraryImport("MyNativeLibrary")]
[return: MarshalUsing(typeof(Vector128ReturnMarshaller<>))]
public static partial Vector128<float> MyNativeFunction(Vector128<float>* input);

And in the custom marshaller, I create a little bit of assembly code to copy the XMM0 register into an instance field in the marshaller. I write this assembly code to some native memory that is marked as executable, and then create a function pointer for it. Then I call that function pointer, and the instance field is returned from the marshaller's ToManaged implementation.

This is just for Windows at x64 - I haven't looked at other platforms / architectures.

As I say, this is just a hacky workaround, and there may well be better ways to do it, but this is just what I came up with.

@xoofx
Copy link
Member

xoofx commented Apr 12, 2023

I came also recently with a similar case where I had to call an interop function taking/returning a Vector128<float> and couldn't make it working without hacking stuffs like @tgjones did.

One piece also that I discovered is that even function pointers over managed functions that are using/returning Vector128<float> (e.g delegate*<Vector128<float>, Vector128<float>>) generate invalid code as well (they assume the Vector is a struct and pass it by pointer on the stack).

I'm wondering what happened to #32278 not to be merged actually?

@tannergooding
Copy link
Member Author

they assume the Vector is a struct and pass it by pointer on the stack

This is correct for Windows. The default calling convention for x64 windows does not pass any SIMD values in register, it only returns them in register (which we aren't doing the latter today).

To be passed in register, you must use __vectorcall which is on the backlog to eventually support.

@xoofx
Copy link
Member

xoofx commented Apr 12, 2023

To be passed in register, you must use __vectorcall which is on the backlog to eventually support.

But for function pointers to regular static managed functions, would that require this? Shouldn't it be part of the built-in support for managed function? (Or should we make a Roslyn error when trying to take the address of a static managed function that has e.g Vector128<float>?)

@tannergooding
Copy link
Member Author

But for function pointers to regular static managed functions, would that require this?

The managed calling convention currently defaults to the "default calling convention" for the operating system. That is ms_abi for Windows and sysv_abi on Unix. We could certainly change this in the future to be __vectorcall on Windows instead, but that would be a transparent change for end users.

For unmanaged you must annotate your function as UnmanagedCallersOnly, this also defaults to the "default calling convention" for the operating system. When we eventually get __vectorcall support added, there will be a corresponding CallConvVectorcall member added and you would annotate your UnmanagedCallersOnly with this as well as ensuring your fnptr is marked as delegate* unmanaged[Vectorcall]<...>.

(Or should we make a Roslyn error when trying to take the address of a static managed function that has e.g Vector128?

I don't see the need or benefit it would simply block a scenario that already works and works correctly. The scenario that doesn't work today is interop with a native function that uses __vectorcall (we also block Vector128<T> in general due to the bug in how vector returns work, but there are workarounds you can do to make it "work").

@xoofx
Copy link
Member

xoofx commented Apr 12, 2023

I don't see the need or benefit it would simply block a scenario that already works and works correctly.

Oh right, I mixed the fact that the managed code is still passing VectorXXX<T> arguments through the stack, but I thought it was not and I was expecting somehow a vectorcall calling convention.

So that's indeed only for the case of unmanaged.

@jakobbotsch
Copy link
Member

Sadly won't get to this one in 8.0 either.

@dmchurch
Copy link

The managed calling convention currently defaults to the "default calling convention" for the operating system. That is ms_abi for Windows and sysv_abi on Unix. We could certainly change this in the future to be __vectorcall on Windows instead, but that would be a transparent change for end users.

Out of curiosity, what's the roadblock in executing on this? I just finished experimenting with using the SIMD intrinsic classes as a primitive for 4-component float/double vector math - originally wrapping them in a readonly struct for ease-of-use but eventually just using the native types and methods directly - and determined that it was wildly unsuitable for that. (I'm aware that, despite their name, vector math is not the primary use case for SIMD vector instructions.) The SIMD intrinsic methods (Vector128.AndNot and such, and also the operators) don't allow themselves to be inlined - they have to appear by name in the source of the method where they're used, or they'll emit a function call into the asm.

That wouldn't be quite as much of a problem as it is, if the vectors weren't going through a store/load with each and every call and return, and function prologs and epilogs everywhere. I wouldn't mind the extra calls nearly so much if they were just vaddps xmm0,xmm0,xmm1; ret - I'd expect the CPU cache to be able to inline that away itself, making it nearly as good as a compiler inline, but all these stack operations make that an impossibility.

It's obviously and sadly too late to do anything about this on current desktop systems, but is there anything that can be done to help move this along so that maybe in another five years or so, it might be possible to use SIMD vectors as vector math primitives in C# in an end-user-facing application?

Also, I had one other question: why are managed calls executed solely within the runtime required to conform to the system ABI, anyway? I'd have thought that the CLR JIT would make better use of system resources, given that - unlike basically every other language platform in the world - it knows what hardware it will run on as it compiles, and it has full executive control over both sides of every function call (at least, the ones that are managed-to-managed code). I don't think any other platform could say "we could change the calling convention to __vectorcall and it would be transparent to end users", and I don't understand why that hasn't been done, given the potential performance benefits. I'm sure there's a very good reason for the internal implementation of the code to respect the ABI even in cases where no unmanaged code is being used, but I have to admit I have not the slightest clue what it could be.

@tannergooding
Copy link
Member Author

Out of curiosity, what's the roadblock in executing on this?

Mostly just being low priority. Interop with native code that directly takes SIMD parameters is rare, especially when most functionality can be trivially written in C#/F# instead and the places where it isn't trivial typically take arrays/pointers as input, not SIMD types.

I just finished experimenting with using the SIMD intrinsic classes as a primitive for 4-component float/double vector math

Why not just use System.Numerics.Vector4, which is already optimized to do this? It doesn't handle double today, but using double for this is rare in the first place.

The SIMD intrinsic methods (Vector128.AndNot and such, and also the operators) don't allow themselves to be inlined - they have to appear by name in the source of the method where they're used, or they'll emit a function call into the asm.

Not sure what you mean? Vector128.AndNot(x, y) emits andnps, andnpd, or pandn on x86/x64, there is no inlining as its directly intrinsic. The same is true for most of the vector APIs.

That wouldn't be quite as much of a problem as it is, if the vectors weren't going through a store/load with each and every call and return, and function prologs and epilogs everywhere. I wouldn't mind the extra calls nearly so much if they were just vaddps xmm0,xmm0,xmm1; ret - I'd expect the CPU cache to be able to inline that away itself, making it nearly as good as a compiler inline, but all these stack operations make that an impossibility.

Could you share some code, your CPU, what runtime you're targeting, and what you're seeing vs expecting to see? From what you're describing, it sounds like you're doing something unique or not looking in the right places to see the real codegen output.

Also, I had one other question: why are managed calls executed solely within the runtime required to conform to the system ABI, anyway?

They are not, but as a matter of convenience and overall performance (especially as it pertains to interop, which regularly happens at all layers of the stack), it tends to be the best approach.

That is, if we deviate from the standard, then every context switch now has to account for that and do additional save/restore work, which often doesn't pay off as compared to simply matching the underlying default ABI.

Additionally, things like __vectorcall aren't that common, you typically have the calls inlined and that avoids the need to pass things across a method boundary in register altogether.

@dmchurch
Copy link

Why not just use System.Numerics.Vector4, which is already optimized to do this? It doesn't handle double today, but using double for this is rare in the first place.

I'd love to, but this is for game code - collision detection and the like, in a custom engine. It has to be at double precision, or the FP errors just add up too fast, especially when checking collisions at some distance from the measurement origin. (Also, ideally, it should be possible to use an int-based vector for doing block/tile math, without having to go through FP conversion.) The existing code uses unaligned structs to pass around vectors of all three types, and I'd really been hoping I could make a drop-in replacement for them based on the SIMD vectors.

They are not, but as a matter of convenience and overall performance (especially as it pertains to interop, which regularly happens at all layers of the stack), it tends to be the best approach.

Ah, okay, that makes sense! Yeah, if the managed codegen isn't using the system ABI, you'd have to keep two copies of the jitted code in memory or dynamically check and recompile on encountering an unmanaged ⇒ managed transition - and yeah, that's extra processor time spent on something that doesn't generally provide a benefit. Makes sense. It's good to know that isn't a hard requirement of the CLR, though.

Could you share some code, your CPU, what runtime you're targeting, and what you're seeing vs expecting to see? From what you're describing, it sounds like you're doing something unique or not looking in the right places to see the real codegen output.

I'd be delighted! I'm running release-optimized .NET 7 code compiled by Visual Studio Community 2022 on a Windows 11 laptop with a 12th-gen Core i9 processor. I followed these steps for peeking at the codegen - I'm compiling in release mode, it's including symbols, it's not disabling managed code optimization in the debug settings, and I'm setting a breakpoint in the method in question, which I've reproduced below:

public EnumCollideFlags ParticlePhysics.UpdateMotion(pos, ref motion, size)
using V128 = System.Runtime.Intrinsics.Vector128;
using V256 = System.Runtime.Intrinsics.Vector256;

using SIMDVec4i = System.Runtime.Intrinsics.Vector128<int>;
using SIMDVec4f = System.Runtime.Intrinsics.Vector128<float>;
using SIMDVec4l = System.Runtime.Intrinsics.Vector256<long>;
using SIMDVec4d = System.Runtime.Intrinsics.Vector256<double>;

    public partial class ParticlePhysics
    {
        public EnumCollideFlags UpdateMotion(SIMDVec4d pos, ref SIMDVec4f motion, float size)
        {
            double halfSize = size / 2;
            SIMDVec4d collMin = pos - V256.Create(halfSize, 0d, halfSize, 0d);
            SIMDVec4d collMax = pos + V256.Create(halfSize, halfSize, halfSize, 0d);
            FastCuboidd particleCollBox = new(collMin, collMax);

            motion = motion.Clamp(-MotionCap, MotionCap);

            SIMDVec4i minLoc = (particleCollBox.Min + motion.ToDouble()).ToInt();
            minLoc -= V128.Create(0, 1, 0, 0);  // -1 for the extra high collision box of fences
            SIMDVec4i maxLoc = (particleCollBox.Max + motion.ToDouble()).ToInt();

            minPos.Set(minLoc.X(), minLoc.Y(), minLoc.Z());
            maxPos.Set(maxLoc.X(), maxLoc.Y(), maxLoc.Z());

            EnumCollideFlags flags = 0;

            // It's easier to compute collisions if we're traveling forward on each axis
// BREAKPOINT BELOW ↓
            SIMDVec4d negativeComponents = motion.ToDouble().LessThan(SIMDVec4d.Zero);
            SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d));
            //SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1));

            SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns;
            particleCollBox.From *= flipSigns;
            particleCollBox.To *= flipSigns;

            particleCollBox.Normalize();

            SIMDVec4d motionMask = flippedMotion.GreaterThan(SIMDVec4d.Zero);

            fastBoxCount = 0;
            BlockAccess.WalkBlocks(minPos, maxPos, (cblock, x, y, z) => {
                Cuboidf[] collisionBoxes = cblock.GetParticleCollisionBoxes(BlockAccess, tmpPos.Set(x, y, z));
                if (collisionBoxes != null) {
                    foreach (Cuboidf collisionBox in collisionBoxes) {
                        FastCuboidd box = collisionBox;
                        box.From *= flipSigns;
                        box.To *= flipSigns;
                        box.Normalize();
                        while (fastBoxCount >= fastBoxList.Length) {
                            Array.Resize(ref fastBoxList, fastBoxList.Length * 2);
                        }
                        fastBoxList[fastBoxCount++] = box;
                    }
                }
            }, false);

            for (int i = 0; i < fastBoxCount; i++) {
                ref FastCuboidd box = ref fastBoxList[i];

                flags |= box.PushOutNormalized(particleCollBox, ref flippedMotion);
            }

            // Restore motion to non-flipped
            motion = (flippedMotion * flipSigns).ToFloat();

            return flags;
        }
    }

There are two declarations of the flipSigns variable, one commented out. When the code is in the state as I've listed it here, using all the SIMD intrinsics directly in the UpdateMotion method, the generated code shows up as follows in the Disassembly window:

            ;SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d));
00007FFA6CF2DFE8  vmovupd     ymm0,ymmword ptr [rbp-130h]  
00007FFA6CF2DFF0  vmovupd     ymm1,ymmword ptr [rbp-130h]  
00007FFA6CF2DFF8  vandpd      ymm0,ymm0,ymmword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+07D0h (07FFA6CF2E340h)]  
00007FFA6CF2E000  vandnpd     ymm1,ymm1,ymmword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+07F0h (07FFA6CF2E360h)]  
00007FFA6CF2E008  vorpd       ymm0,ymm0,ymm1  
00007FFA6CF2E00C  mov         rdx,qword ptr [rbp-40h]  
00007FFA6CF2E010  vmovupd     ymmword ptr [rdx+28h],ymm0  
            ;//SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1));
            ;SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns;

When I switch to the other declaration, which uses convenience static/extension methods which simply call and return the relevant intrinsic, the codegen is as follows:

            ;//SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d));
            ;SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1));
00007FFA6FCFA518  lea         rcx,[rbp-290h]  
00007FFA6FCFA51F  vmovsd      xmm1,qword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+0800h (07FFA6FCFA8B0h)]  
00007FFA6FCFA527  call        qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FDA7630 (07FFA6FDA7630h)]  ; pointer to SIMDVec.Double
00007FFA6FCFA52D  lea         rcx,[rbp-2B0h]  
00007FFA6FCFA534  vmovsd      xmm1,qword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+0808h (07FFA6FCFA8B8h)]  
00007FFA6FCFA53C  call        qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FDA7630 (07FFA6FDA7630h)]  ; pointer to SIMDVec.Double
00007FFA6FCFA542  mov         rcx,qword ptr [rbp-40h]  
00007FFA6FCFA546  cmp         byte ptr [rcx],cl  
00007FFA6FCFA548  mov         rcx,qword ptr [rbp-40h]  
00007FFA6FCFA54C  add         rcx,28h  
00007FFA6FCFA550  mov         qword ptr [rbp-538h],rcx  
00007FFA6FCFA557  vmovupd     ymm0,ymmword ptr [rbp-130h]  
00007FFA6FCFA55F  vmovupd     ymmword ptr [rbp-4F0h],ymm0  
00007FFA6FCFA567  vmovupd     ymm0,ymmword ptr [rbp-290h]  
00007FFA6FCFA56F  vmovupd     ymmword ptr [rbp-510h],ymm0  
00007FFA6FCFA577  vmovupd     ymm0,ymmword ptr [rbp-2B0h]  
00007FFA6FCFA57F  vmovupd     ymmword ptr [rbp-530h],ymm0  
00007FFA6FCFA587  mov         rcx,qword ptr [rbp-538h]  
00007FFA6FCFA58E  lea         rdx,[rbp-4F0h]  
00007FFA6FCFA595  lea         r8,[rbp-510h]  
00007FFA6FCFA59C  lea         r9,[rbp-530h]  
00007FFA6FCFA5A3  call        qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FE959C0 (07FFA6FE959C0h)]  ;pointer to SIMDVec.ConditionalSelect
            ;SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns;
public static SIMDVec4d Double(double xyzw) codegen
        ;[MethodImpl(MethodImplOptions.AggressiveInlining)] public static SIMDVec4d Double(double xyzw) => V256.Create(xyzw);
00007FFA6FD01C00  push        rbp  
00007FFA6FD01C01  vzeroupper  
00007FFA6FD01C04  mov         rbp,rsp  
00007FFA6FD01C07  mov         qword ptr [rbp+10h],rcx  
00007FFA6FD01C0B  vmovsd      qword ptr [rbp+18h],xmm1  
00007FFA6FD01C10  vbroadcastsd ymm0,mmword ptr [rbp+18h]  
00007FFA6FD01C16  mov         rax,qword ptr [rbp+10h]  
00007FFA6FD01C1A  vmovupd     ymmword ptr [rax],ymm0  
00007FFA6FD01C1E  mov         rax,qword ptr [rbp+10h]  
00007FFA6FD01C22  vzeroupper  
00007FFA6FD01C25  pop         rbp  
00007FFA6FD01C26  ret  
public static Vector256 ConditionalSelect(this condition, valueIfTrue, valueIfFalse) codegen
        ;public static Vector256<T> ConditionalSelect<T>(this Vector256<T> condition, Vector256<T> valueIfTrue, Vector256<T> valueIfFalse) where T : struct => V256.ConditionalSelect(condition, valueIfTrue, valueIfFalse);
00007FFA6FD0C730  push        rbp  
00007FFA6FD0C731  sub         rsp,30h  
00007FFA6FD0C735  vzeroupper  
00007FFA6FD0C738  lea         rbp,[rsp+30h]  
00007FFA6FD0C73D  mov         qword ptr [rbp+10h],rcx  
00007FFA6FD0C741  mov         qword ptr [rbp+18h],rdx  
00007FFA6FD0C745  mov         qword ptr [rbp+20h],r8  
00007FFA6FD0C749  mov         qword ptr [rbp+28h],r9  
00007FFA6FD0C74D  mov         rax,qword ptr [rbp+18h]  
00007FFA6FD0C751  vmovupd     ymm0,ymmword ptr [rax]  
00007FFA6FD0C755  vmovupd     ymmword ptr [rbp-30h],ymm0  
00007FFA6FD0C75A  vmovupd     ymm0,ymmword ptr [rbp-30h]  
00007FFA6FD0C75F  vmovupd     ymm1,ymmword ptr [rbp-30h]  
00007FFA6FD0C764  mov         rax,qword ptr [rbp+20h]  
00007FFA6FD0C768  vandpd      ymm0,ymm0,ymmword ptr [rax]  
00007FFA6FD0C76C  mov         rax,qword ptr [rbp+28h]  
00007FFA6FD0C770  vandnpd     ymm1,ymm1,ymmword ptr [rax]  
00007FFA6FD0C774  vorpd       ymm0,ymm0,ymm1  
00007FFA6FD0C778  mov         rax,qword ptr [rbp+10h]  
00007FFA6FD0C77C  vmovupd     ymmword ptr [rax],ymm0  
00007FFA6FD0C780  mov         rax,qword ptr [rbp+10h]  
00007FFA6FD0C784  vzeroupper  
00007FFA6FD0C787  add         rsp,30h  
00007FFA6FD0C78B  pop         rbp  
00007FFA6FD0C78C  ret  

As you can see, all the relevant opcodes eventually get executed, but the amount of overhead dwarfs the actual business logic by an order of magnitude or more. This pattern - that the intrinsic instructions only ever get emitted into the method that mentions the VectorNNN class by name (or, in this case, by type alias) - held no matter what MethodImpl attributes I applied and what qualifiers I added to the various parameters (in, out, ref, scoped, etc). When I was using a readonly struct to hold the VectorNNN value (in order to take advantage of custom operator overloads, type conversions, and so forth) the result was the same, but with an additional layer of indirection (despite my efforts, I never managed to convince the runtime to reinterpret a VectorNNN as a wrapper struct, or a wrapper struct as a VectorNNN, without at least an extra copy, and usually including an additional function call to the non-inlined conversion, as well).

If you've got any insight, or any thoughts on other avenues to try, I'd be extremely interested to hear!

@tannergooding
Copy link
Member Author

I'd love to, but this is for game code - collision detection and the like, in a custom engine. It has to be at double precision, or the FP errors just add up too fast, especially when checking collisions at some distance from the measurement origin

This is a common bug in how you're doing your logic. You should implement something akin to a floating-origin and chunking your world so that you're never doing computations in a way that could cause such large errors to exist.

There are many talks about this presented during things like GDC or which have had deep dive talks given from AAA game companies, they all tend towards using float for performance (both CPU and GPU) and some have even switched to half (GPU) for performance reasons in really hot code.

I'd be delighted! I'm running release-optimized .NET 7 code compiled by Visual Studio Community 2022 on a Windows 11 laptop with a 12th-gen Core i9 processor. I followed these steps for peeking at the codegen - I'm compiling in release mode, it's including symbols, it's not disabling managed code optimization in the debug settings, and I'm setting a breakpoint in the method in question, which I've reproduced below:

Notably, in .NET 7+ you can also just use DOTNET_JitDisasm="MethodName" and run your program, that's a bit easier/more reliable and will let you see both the T0 (unoptimized) and T1 (optimized) codegen, as well as whether or not it was invoked enough to allow rejit to occur and therefore T1 code to be produced.

@EgorBo has also provided the amazing https://github.com/EgorBo/Disasmo extension for VS which makes it very trivial to right click->Disasm this and see the optimized disassembly output (with UI to help with a number of other common scenarios as well).

As you can see, all the relevant opcodes eventually get executed, but the amount of overhead dwarfs the actual business logic by an order of magnitude or more.

This should be fixed in .NET 8+ (not enough code here for me to 100% confirm that, however). I expect the issue was due to the shadow copy and lack of forward sub. If you had taken the parameters as in then the JIT would likely do the right thing on .NET 7 as well as it would avoid the shadow copy and see its not mutated after inlining.

Notably, rather than multiplying by -1 or +1, just x ^ V256.Create(-0.0) to flip the sign (float/double are one's complement, so you just need to toggle the sign bit, hence xor with negative zero).

@dmchurch
Copy link

Notably, rather than multiplying by -1 or +1, just x ^ V256.Create(-0.0) to flip the sign (float/double are one's complement, so you just need to toggle the sign bit, hence xor with negative zero).

Brilliant, using -0.0! I'd thought of xor'ing the sign bit, but I was feeling lazy and couldn't recall what the exact representation of a double was. (And in any case, I was just doing this as a proof-of-concept, to see what the runtime does with the SIMD vectors in general - most especially, whether they could be passed in registers across method calls. This game passes a lot of vectors across method calls. It's also worth noting that the implementation in question does not, in fact, actually work - I didn't bother to work out any of the bugs, I just wanted a set of representative SIMD operations in a method that does vector math.)

You should implement something akin to a floating-origin and chunking your world so that you're never doing computations in a way that could cause such large errors to exist.

Yeah, that'd have been my thought as well, actually - I may have seen one or two of those GDC talks 😂 This isn't my game or my code, though, so I mainly wanted to see if it was even a possibility to make a drop-in replacement for the existing double-based vector structs, because if so, I could toggle between the two implementations with nothing more than a build flag.

Notably, in .NET 7+ you can also just use DOTNET_JitDisasm="MethodName" and run your program

Oooh, that's neat! I'm definitely gonna have to play around with both that and the Disasmo extension, because I love poking into implementations and seeing how they work.

If you had taken the parameters as in then the JIT would likely do the right thing on .NET 7 as well

I did try putting in or scoped in on various parameters, as well as moving things between instance methods and extension methods (when I was using the readonly struct wrapper), but I wasn't able to figure out a combination that would allow it to inline an intrinsic. (That doesn't mean there isn't one, but if there is, I couldn't figure out what it should be.) That said, even if I could get that working properly, my gut instinct is that without being able to pass a SIMD vector in a single register, there will just be too much overhead for it to grant any performance benefit.

Which, I mean, again - this isn't what SIMD is for, so I knew I was gonna be working against the grain when I started poking into this 😅

@tannergooding
Copy link
Member Author

Is the code open source on GitHub? I might be able to give it a lookover and provide suggested fixes for the obvious cases and ensure we have tracking bugs for any of the less obvious ones.

@dmchurch
Copy link

Not yet but I'll clean this up a touch and push it so you can take a look! It's a fork of https://github.com/anegostudios/vsapi, with the existing non-readonly structs in https://github.com/anegostudios/vsapi/tree/master/Math/Vector (they're the FastVec types - the Vec types are classes).

@dmchurch
Copy link

I've gone ahead and pushed my code to dmchurch/vsapi@simd-experiments but please, don't spend too much time on this! Especially since Disasmo has shown me that what I was looking at was, in fact, the T0 compilation - looks like the learn.microsoft link I followed is a bit out of date on its recommendations, ha.

Anyway, feel free to take a look, but please don't take this as anything other than me experimenting with the SIMD capabilities of C#. 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests