-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the appropriate ABI handling for the SIMD HWIntrinsic types #9578
Comments
This would change calling convention for these types. In general, calling convention changes invalidate pre-compile code out there. I think it would be a good idea to disable loading of these types during crossgen so that they are not baked into any precompiled code and we do not have our hands tied with changing the calling convention. |
The timezone ids used case insensitive comparisons everywhere, except in the dictionary used to cache timezones. Fixes dotnet/coreclr#15943
@jkotas, Is there a good example of existing types disabled during crossgen? I would like to include that change in dotnet/coreclr#15942, if possible. |
Take a look how |
@jkotas, @CarolEidt. Part of the ABI work for these types is respecting their larger packing sizes (8 for __m64, 16 for __m128, 32 for __m256). Do you think it is reasonable to have the packing sizes respected for v1 (it looks like it only needs a relatively small update in the VM layer)? |
I think it is reasonable. |
Should this be in 2.1 (not 2.0.x) ? |
Yes, I believe so. |
An explicit example of where the current ABI is wrong is for x64 Windows with SIMD returns. The default calling convention for x64 Windows specifies that __m128, __m128i, and __m128d are returned in XMM0: https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019#return-values These types correspond to the System.Runtime.Intrinsics.Vector128 type on the managed side and it is not currently being returned in XMM0. dotnet/coreclr#23899 adds support for passing Vector128 across interop boundaries and so this will need to be correctly handled. |
@tannergooding is this still relevant? |
Yes. We still need to ensure that these types are being correctly handled at the ABI level so we can enable them for interop scenarios and so we know we have good/correct/performant codegen even in managed only land. |
This won't make 7.0. |
I came up with a hacky workaround for I use a custom marshaller for the return type of a [LibraryImport("MyNativeLibrary")]
[return: MarshalUsing(typeof(Vector128ReturnMarshaller<>))]
public static partial Vector128<float> MyNativeFunction(Vector128<float>* input); And in the custom marshaller, I create a little bit of assembly code to copy the XMM0 register into an instance field in the marshaller. I write this assembly code to some native memory that is marked as executable, and then create a function pointer for it. Then I call that function pointer, and the instance field is returned from the marshaller's This is just for Windows at x64 - I haven't looked at other platforms / architectures. As I say, this is just a hacky workaround, and there may well be better ways to do it, but this is just what I came up with. |
I came also recently with a similar case where I had to call an interop function taking/returning a One piece also that I discovered is that even function pointers over managed functions that are using/returning I'm wondering what happened to #32278 not to be merged actually? |
This is correct for Windows. The default calling convention for x64 windows does not pass any SIMD values in register, it only returns them in register (which we aren't doing the latter today). To be passed in register, you must use |
But for function pointers to regular static managed functions, would that require this? Shouldn't it be part of the built-in support for managed function? (Or should we make a Roslyn error when trying to take the address of a static managed function that has e.g |
The managed calling convention currently defaults to the "default calling convention" for the operating system. That is For
I don't see the need or benefit it would simply block a scenario that already works and works correctly. The scenario that doesn't work today is interop with a native function that uses |
Oh right, I mixed the fact that the managed code is still passing So that's indeed only for the case of unmanaged. |
Sadly won't get to this one in 8.0 either. |
Out of curiosity, what's the roadblock in executing on this? I just finished experimenting with using the SIMD intrinsic classes as a primitive for 4-component float/double vector math - originally wrapping them in a That wouldn't be quite as much of a problem as it is, if the vectors weren't going through a store/load with each and every call and return, and function prologs and epilogs everywhere. I wouldn't mind the extra calls nearly so much if they were just It's obviously and sadly too late to do anything about this on current desktop systems, but is there anything that can be done to help move this along so that maybe in another five years or so, it might be possible to use SIMD vectors as vector math primitives in C# in an end-user-facing application? Also, I had one other question: why are managed calls executed solely within the runtime required to conform to the system ABI, anyway? I'd have thought that the CLR JIT would make better use of system resources, given that - unlike basically every other language platform in the world - it knows what hardware it will run on as it compiles, and it has full executive control over both sides of every function call (at least, the ones that are managed-to-managed code). I don't think any other platform could say "we could change the calling convention to |
Mostly just being low priority. Interop with native code that directly takes SIMD parameters is rare, especially when most functionality can be trivially written in C#/F# instead and the places where it isn't trivial typically take arrays/pointers as input, not SIMD types.
Why not just use
Not sure what you mean?
Could you share some code, your CPU, what runtime you're targeting, and what you're seeing vs expecting to see? From what you're describing, it sounds like you're doing something unique or not looking in the right places to see the real codegen output.
They are not, but as a matter of convenience and overall performance (especially as it pertains to interop, which regularly happens at all layers of the stack), it tends to be the best approach. That is, if we deviate from the standard, then every context switch now has to account for that and do additional save/restore work, which often doesn't pay off as compared to simply matching the underlying default ABI. Additionally, things like |
I'd love to, but this is for game code - collision detection and the like, in a custom engine. It has to be at double precision, or the FP errors just add up too fast, especially when checking collisions at some distance from the measurement origin. (Also, ideally, it should be possible to use an
Ah, okay, that makes sense! Yeah, if the managed codegen isn't using the system ABI, you'd have to keep two copies of the jitted code in memory or dynamically check and recompile on encountering an unmanaged ⇒ managed transition - and yeah, that's extra processor time spent on something that doesn't generally provide a benefit. Makes sense. It's good to know that isn't a hard requirement of the CLR, though.
I'd be delighted! I'm running release-optimized .NET 7 code compiled by Visual Studio Community 2022 on a Windows 11 laptop with a 12th-gen Core i9 processor. I followed these steps for peeking at the codegen - I'm compiling in release mode, it's including symbols, it's not disabling managed code optimization in the debug settings, and I'm setting a breakpoint in the method in question, which I've reproduced below: public EnumCollideFlags ParticlePhysics.UpdateMotion(pos, ref motion, size)using V128 = System.Runtime.Intrinsics.Vector128;
using V256 = System.Runtime.Intrinsics.Vector256;
using SIMDVec4i = System.Runtime.Intrinsics.Vector128<int>;
using SIMDVec4f = System.Runtime.Intrinsics.Vector128<float>;
using SIMDVec4l = System.Runtime.Intrinsics.Vector256<long>;
using SIMDVec4d = System.Runtime.Intrinsics.Vector256<double>;
public partial class ParticlePhysics
{
public EnumCollideFlags UpdateMotion(SIMDVec4d pos, ref SIMDVec4f motion, float size)
{
double halfSize = size / 2;
SIMDVec4d collMin = pos - V256.Create(halfSize, 0d, halfSize, 0d);
SIMDVec4d collMax = pos + V256.Create(halfSize, halfSize, halfSize, 0d);
FastCuboidd particleCollBox = new(collMin, collMax);
motion = motion.Clamp(-MotionCap, MotionCap);
SIMDVec4i minLoc = (particleCollBox.Min + motion.ToDouble()).ToInt();
minLoc -= V128.Create(0, 1, 0, 0); // -1 for the extra high collision box of fences
SIMDVec4i maxLoc = (particleCollBox.Max + motion.ToDouble()).ToInt();
minPos.Set(minLoc.X(), minLoc.Y(), minLoc.Z());
maxPos.Set(maxLoc.X(), maxLoc.Y(), maxLoc.Z());
EnumCollideFlags flags = 0;
// It's easier to compute collisions if we're traveling forward on each axis
// BREAKPOINT BELOW ↓
SIMDVec4d negativeComponents = motion.ToDouble().LessThan(SIMDVec4d.Zero);
SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d));
//SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1));
SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns;
particleCollBox.From *= flipSigns;
particleCollBox.To *= flipSigns;
particleCollBox.Normalize();
SIMDVec4d motionMask = flippedMotion.GreaterThan(SIMDVec4d.Zero);
fastBoxCount = 0;
BlockAccess.WalkBlocks(minPos, maxPos, (cblock, x, y, z) => {
Cuboidf[] collisionBoxes = cblock.GetParticleCollisionBoxes(BlockAccess, tmpPos.Set(x, y, z));
if (collisionBoxes != null) {
foreach (Cuboidf collisionBox in collisionBoxes) {
FastCuboidd box = collisionBox;
box.From *= flipSigns;
box.To *= flipSigns;
box.Normalize();
while (fastBoxCount >= fastBoxList.Length) {
Array.Resize(ref fastBoxList, fastBoxList.Length * 2);
}
fastBoxList[fastBoxCount++] = box;
}
}
}, false);
for (int i = 0; i < fastBoxCount; i++) {
ref FastCuboidd box = ref fastBoxList[i];
flags |= box.PushOutNormalized(particleCollBox, ref flippedMotion);
}
// Restore motion to non-flipped
motion = (flippedMotion * flipSigns).ToFloat();
return flags;
}
} There are two declarations of the ;SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d));
00007FFA6CF2DFE8 vmovupd ymm0,ymmword ptr [rbp-130h]
00007FFA6CF2DFF0 vmovupd ymm1,ymmword ptr [rbp-130h]
00007FFA6CF2DFF8 vandpd ymm0,ymm0,ymmword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+07D0h (07FFA6CF2E340h)]
00007FFA6CF2E000 vandnpd ymm1,ymm1,ymmword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+07F0h (07FFA6CF2E360h)]
00007FFA6CF2E008 vorpd ymm0,ymm0,ymm1
00007FFA6CF2E00C mov rdx,qword ptr [rbp-40h]
00007FFA6CF2E010 vmovupd ymmword ptr [rdx+28h],ymm0
;//SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1));
;SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns; When I switch to the other declaration, which uses convenience static/extension methods which simply call and return the relevant intrinsic, the codegen is as follows: ;//SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d));
;SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1));
00007FFA6FCFA518 lea rcx,[rbp-290h]
00007FFA6FCFA51F vmovsd xmm1,qword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+0800h (07FFA6FCFA8B0h)]
00007FFA6FCFA527 call qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FDA7630 (07FFA6FDA7630h)] ; pointer to SIMDVec.Double
00007FFA6FCFA52D lea rcx,[rbp-2B0h]
00007FFA6FCFA534 vmovsd xmm1,qword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+0808h (07FFA6FCFA8B8h)]
00007FFA6FCFA53C call qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FDA7630 (07FFA6FDA7630h)] ; pointer to SIMDVec.Double
00007FFA6FCFA542 mov rcx,qword ptr [rbp-40h]
00007FFA6FCFA546 cmp byte ptr [rcx],cl
00007FFA6FCFA548 mov rcx,qword ptr [rbp-40h]
00007FFA6FCFA54C add rcx,28h
00007FFA6FCFA550 mov qword ptr [rbp-538h],rcx
00007FFA6FCFA557 vmovupd ymm0,ymmword ptr [rbp-130h]
00007FFA6FCFA55F vmovupd ymmword ptr [rbp-4F0h],ymm0
00007FFA6FCFA567 vmovupd ymm0,ymmword ptr [rbp-290h]
00007FFA6FCFA56F vmovupd ymmword ptr [rbp-510h],ymm0
00007FFA6FCFA577 vmovupd ymm0,ymmword ptr [rbp-2B0h]
00007FFA6FCFA57F vmovupd ymmword ptr [rbp-530h],ymm0
00007FFA6FCFA587 mov rcx,qword ptr [rbp-538h]
00007FFA6FCFA58E lea rdx,[rbp-4F0h]
00007FFA6FCFA595 lea r8,[rbp-510h]
00007FFA6FCFA59C lea r9,[rbp-530h]
00007FFA6FCFA5A3 call qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FE959C0 (07FFA6FE959C0h)] ;pointer to SIMDVec.ConditionalSelect
;SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns; public static SIMDVec4d Double(double xyzw) codegen ;[MethodImpl(MethodImplOptions.AggressiveInlining)] public static SIMDVec4d Double(double xyzw) => V256.Create(xyzw);
00007FFA6FD01C00 push rbp
00007FFA6FD01C01 vzeroupper
00007FFA6FD01C04 mov rbp,rsp
00007FFA6FD01C07 mov qword ptr [rbp+10h],rcx
00007FFA6FD01C0B vmovsd qword ptr [rbp+18h],xmm1
00007FFA6FD01C10 vbroadcastsd ymm0,mmword ptr [rbp+18h]
00007FFA6FD01C16 mov rax,qword ptr [rbp+10h]
00007FFA6FD01C1A vmovupd ymmword ptr [rax],ymm0
00007FFA6FD01C1E mov rax,qword ptr [rbp+10h]
00007FFA6FD01C22 vzeroupper
00007FFA6FD01C25 pop rbp
00007FFA6FD01C26 ret public static Vector256 ConditionalSelect(this condition, valueIfTrue, valueIfFalse) codegen ;public static Vector256<T> ConditionalSelect<T>(this Vector256<T> condition, Vector256<T> valueIfTrue, Vector256<T> valueIfFalse) where T : struct => V256.ConditionalSelect(condition, valueIfTrue, valueIfFalse);
00007FFA6FD0C730 push rbp
00007FFA6FD0C731 sub rsp,30h
00007FFA6FD0C735 vzeroupper
00007FFA6FD0C738 lea rbp,[rsp+30h]
00007FFA6FD0C73D mov qword ptr [rbp+10h],rcx
00007FFA6FD0C741 mov qword ptr [rbp+18h],rdx
00007FFA6FD0C745 mov qword ptr [rbp+20h],r8
00007FFA6FD0C749 mov qword ptr [rbp+28h],r9
00007FFA6FD0C74D mov rax,qword ptr [rbp+18h]
00007FFA6FD0C751 vmovupd ymm0,ymmword ptr [rax]
00007FFA6FD0C755 vmovupd ymmword ptr [rbp-30h],ymm0
00007FFA6FD0C75A vmovupd ymm0,ymmword ptr [rbp-30h]
00007FFA6FD0C75F vmovupd ymm1,ymmword ptr [rbp-30h]
00007FFA6FD0C764 mov rax,qword ptr [rbp+20h]
00007FFA6FD0C768 vandpd ymm0,ymm0,ymmword ptr [rax]
00007FFA6FD0C76C mov rax,qword ptr [rbp+28h]
00007FFA6FD0C770 vandnpd ymm1,ymm1,ymmword ptr [rax]
00007FFA6FD0C774 vorpd ymm0,ymm0,ymm1
00007FFA6FD0C778 mov rax,qword ptr [rbp+10h]
00007FFA6FD0C77C vmovupd ymmword ptr [rax],ymm0
00007FFA6FD0C780 mov rax,qword ptr [rbp+10h]
00007FFA6FD0C784 vzeroupper
00007FFA6FD0C787 add rsp,30h
00007FFA6FD0C78B pop rbp
00007FFA6FD0C78C ret As you can see, all the relevant opcodes eventually get executed, but the amount of overhead dwarfs the actual business logic by an order of magnitude or more. This pattern - that the intrinsic instructions only ever get emitted into the method that mentions the VectorNNN class by name (or, in this case, by type alias) - held no matter what If you've got any insight, or any thoughts on other avenues to try, I'd be extremely interested to hear! |
This is a common bug in how you're doing your logic. You should implement something akin to a floating-origin and chunking your world so that you're never doing computations in a way that could cause such large errors to exist. There are many talks about this presented during things like GDC or which have had deep dive talks given from AAA game companies, they all tend towards using
Notably, in .NET 7+ you can also just use @EgorBo has also provided the amazing https://github.com/EgorBo/Disasmo extension for VS which makes it very trivial to
This should be fixed in .NET 8+ (not enough code here for me to 100% confirm that, however). I expect the issue was due to the shadow copy and lack of forward sub. If you had taken the parameters as Notably, rather than multiplying by |
Brilliant, using -0.0! I'd thought of xor'ing the sign bit, but I was feeling lazy and couldn't recall what the exact representation of a double was. (And in any case, I was just doing this as a proof-of-concept, to see what the runtime does with the SIMD vectors in general - most especially, whether they could be passed in registers across method calls. This game passes a lot of vectors across method calls. It's also worth noting that the implementation in question does not, in fact, actually work - I didn't bother to work out any of the bugs, I just wanted a set of representative SIMD operations in a method that does vector math.)
Yeah, that'd have been my thought as well, actually - I may have seen one or two of those GDC talks 😂 This isn't my game or my code, though, so I mainly wanted to see if it was even a possibility to make a drop-in replacement for the existing double-based vector structs, because if so, I could toggle between the two implementations with nothing more than a build flag.
Oooh, that's neat! I'm definitely gonna have to play around with both that and the Disasmo extension, because I love poking into implementations and seeing how they work.
I did try putting Which, I mean, again - this isn't what SIMD is for, so I knew I was gonna be working against the grain when I started poking into this 😅 |
Is the code open source on GitHub? I might be able to give it a lookover and provide suggested fixes for the obvious cases and ensure we have tracking bugs for any of the less obvious ones. |
Not yet but I'll clean this up a touch and push it so you can take a look! It's a fork of https://github.com/anegostudios/vsapi, with the existing non-readonly structs in https://github.com/anegostudios/vsapi/tree/master/Math/Vector (they're the |
I've gone ahead and pushed my code to dmchurch/vsapi@simd-experiments but please, don't spend too much time on this! Especially since Disasmo has shown me that what I was looking at was, in fact, the T0 compilation - looks like the learn.microsoft link I followed is a bit out of date on its recommendations, ha. Anyway, feel free to take a look, but please don't take this as anything other than me experimenting with the SIMD capabilities of C#. 😂 |
The SIMD HWIntrinsic types (
Vector64<T>
.Vector128<T>
, andVector256<T>
) are special and represent the__m64
,__m128
, and__m256
ABI types.These types have special handling in both the System V and Windows ABI and are treated as "scalar" (e.g. non aggregate and non union) types for the purpose of parameter passing or value returns. They additionally play some role in the selection of MultiReg or HVA (also known as HFA) structs.
We should add the appropriate support for these types to ensure we are meeting the requirement of the underlying ABI for a given platform/system.
category:correctness
theme:runtime
skill-level:expert
cost:large
impact:medium
The text was updated successfully, but these errors were encountered: