Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added SIMD path for Vector3, Quaternion and Matrix4x4 operations #99547

Closed
wants to merge 7 commits into from

Conversation

martenf
Copy link
Contributor

@martenf martenf commented Mar 11, 2024

Added SIMD path using Vector128 (and Vector256 for Matrix4x4) for:

Vector3.Cross(Vector3 vector1, Vector3 vector2)

Method Mean Error StdDev Ratio RatioSD
Before 1.3289 ns 0.0531 ns 0.0708 ns 1.00 0.00
After 0.6613 ns 0.0399 ns 0.0833 ns 0.51 0.05

Vector3.Transform(Vector3 value, Quaternion rotation)

Method Mean Error StdDev Ratio RatioSD
Before 3.922 ns 0.1034 ns 0.1380 ns 1.00 0.00
After 1.258 ns 0.0508 ns 0.0776 ns 0.32 0.02

Quaternion operator *(Quaternion value1, Quaternion value2)

Method Mean Error StdDev Ratio RatioSD
dotnet8 4.0531 ns 0.1008 ns 0.0990 ns 1.00 0.00
dotnet9 0.7203 ns 0.0312 ns 0.0292 ns 0.18 0.01
after 0.5752 ns 0.0185 ns 0.0173 ns 0.14 0.00

Quaternion operator /(Quaternion value1, Quaternion value2)

Method Mean Error StdDev Ratio RatioSD
before 6.010 ns 0.1399 ns 0.1240 ns 1.00 0.00
after 2.377 ns 0.0334 ns 0.0296 ns 0.40 0.01

Quaternion Concatenate(Quaternion value1, Quaternion value2)

Method Mean Error StdDev Ratio RatioSD
before 4.166 ns 0.0344 ns 0.0321 ns 0.69 0.02
after 1.266 ns 0.0443 ns 0.0414 ns 0.21 0.01

Impl operator *(in Impl left, in Impl right)

Method Mean Error StdDev Ratio
before 8.009 ns 0.0856 ns 0.0801 ns 1.00
afterVector128 5.407 ns 0.0604 ns 0.0565 ns 0.68
afterVector256 3.142 ns 0.0616 ns 0.0546 ns 0.39

Vector3.Cross(Vector3 vector1, Vector3 vector2)
Vector3.Transform(Vector3 value, Quaternion rotation)
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 11, 2024
@martenf
Copy link
Contributor Author

martenf commented Mar 11, 2024

@dotnet-policy-service agree

added simd paths for quaternion multiplication, division and concatenate
added simd path for matrix4x4 multiplication
@martenf martenf changed the title Added SIMD path for Vector3.Cross and Vector3.Transform Added SIMD path for Vector3, Quaternion and Matrix4x4 operations Mar 11, 2024
@martenf
Copy link
Contributor Author

martenf commented Apr 4, 2024

@dotnet/area-system-numerics

Comment on lines +531 to +546
if (Vector128.IsHardwareAccelerated)
{
var vVector = value.AsVector128();
var rVector = Unsafe.BitCast<Quaternion, Vector128<float>>(rotation);

// v + 2 * (q x (q.W * v + q x v)
return (vVector + Vector128.Create(2f) * Cross(rVector, Vector128.Shuffle(rVector, Vector128.Create(3, 3, 3, 3)) * vVector + Cross(rVector, vVector))).AsVector3();

static Vector128<float> Cross(Vector128<float> v1, Vector128<float> v2)
{
return (Vector128.Shuffle(v1, Vector128.Create(1, 2, 0, 3)) *
Vector128.Shuffle(v2, Vector128.Create(2, 0, 1, 3))) -
(Vector128.Shuffle(v1, Vector128.Create(2, 0, 1, 3)) *
Vector128.Shuffle(v2, Vector128.Create(1, 2, 0, 3)));
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not immediately following where this implementation is from...

The standard naive algorithm is typically broken down to something like:

return ((Quaternion.Conjugate(rotation) * new Quaternion(value, 0.0f)) * rotation).AsVector128().AsVector3();

(noting that Quaternion.operator *(Quaternion, Quaterion) is not currently marked as aggressively inline)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it from a private C library I wrote a decade ago which quotes a dead link.
The closest I could find on the internet after a quick search is this forum post which gets 95% of the way there and just misses the last 'simplification for readability' step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update this to follow the "standard" algorithm I gave above.

I expect we'll get better support for things like constant folding, more consistent performance across a range of hardware, and we can accurately quote the root implementation to a known source (which avoids any potential licensing issues): https://github.com/microsoft/DirectXMath/blob/main/Inc/DirectXMathVector.inl#L10307-L10317

IIRC, glm uses a similar algorithm for it's implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naive implementation is even slower then the dotnet8 implementation, even with the dotnet9 improvements to quaternion multiplication. It will also always be slower then the reduced version because it contains significantly more math operation.

I also don't see any potential licensing problem. I wrote the c# code. Btw it looks nothing like c code I based it on which I also wrote. Both version are based on math that cannot be copyrighted and that has been floating around the internet for at least decade at this point.

To reduce any variation between different archs the implementation could be changed to

return (vVector + 2 * Cross(rVector, Vector128.Create(rotation.W) * vVector + Cross(rVector, vVector))).AsVector3();

which would give the JIT a bit more leeway, but I doubt it makes any differens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naive implementation is even slower then the dotnet8 implementation, even with the dotnet9 improvements to quaternion multiplication

This is in part because quaternion multiplication is not being inlined currently. But, also because it's doing some "unnecessary" work, as you indicated. But that can be accounted for as well.

I also don't see any potential licensing problem. I wrote the c# code. Btw it looks nothing like c code I based it on which I also wrote. Both version are based on math that cannot be copyrighted and that has been floating around the internet for at least decade at this point.

In general we require correct attribution to exist and based on the statement given above, the code (even if you authored it) was itself based on some internet blog/forum/site/etc post that can no longer be referenced (due to a dead link). Such a post may have itself may have been based on something else which overall makes it very difficult for us to use "safely" as the chain of citations is broken.

The actual requirements for attribution, whether or not something can be copyrighted, and the like can get quite complex. When there isn't a clear chain anymore, then it requires additional effort on our part, potentially even requiring checks with legal, so an actual lawyer can make the determination on whether or not it is safe to do.

.NET is a huge open source project depending on by millions. We must do the due diligence and ensure that all the boxes are checked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes the result of Q*vQ any more correct that any of the other ways you could use to rotate a vector by the rotation saved in a quaternion? Rodrigues' formula is actually older than Hamilton's, but it's not like any of this is a universal law written in the stars. This entire conversation feels pointless so I will stop wasting everyone's time, close this pull request and just use a private library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a mathematical perspective they all produce the "same" result. The consideration is that they are not the same from a programmatic perspective due to additional considerations.

The intent was to push this PR in a direction where we could still improve the performance but without changing certain invariants that are extremely important to maintain. That requires more in depth consideration and documentation to show how those requirements are being met.

The intent was not to waste time or get into an argument over what is right vs wrong. The contribution was appreciated and was otherwise nearly in the right shape to be taken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know and appreciate that you weren't trying to waste any time, but I just don't have the spare time to have long debates about computational vs mathematical correctness right now, so I think it's just easier if I leave it at that so that someone else can make a pull request if they run into the same bottlenecks.

Copy link
Member

@tannergooding tannergooding Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, are you fine with me cherry-picking your commits into a PR so I can fix the last couple bits of feedback and get merged? That way you can still get credit for the overall contribution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Comment on lines 163 to 178
var l0012 = Vector128.Shuffle(left, Vector128.Create(0, 0, 1, 2));
var l1120 = Vector128.Shuffle(left, Vector128.Create(1, 1, 2, 0));
var r0333 = Vector128.Shuffle(right, Vector128.Create(0, 3, 3, 3));
var r1201 = Vector128.Shuffle(right, Vector128.Create(1, 2, 0, 1));

var t0 = l0012 * r0333 + l1120 * r1201;
var mask = Vector128.Create(0x80000000, 0, 0, 0);
var t0m = Vector128.Xor(t0, mask.AsSingle());

var l2201 = Vector128.Shuffle(left, Vector128.Create(2, 2, 0, 1));
var l3333 = Vector128.Shuffle(left, Vector128.Create(3, 3, 3, 3));
var r2120 = Vector128.Shuffle(right, Vector128.Create(2, 1, 2, 0));
var r3012 = Vector128.Shuffle(right, Vector128.Create(3, 0, 1, 2));

var t1 = l3333 * r3012 - l2201 * r2120 + t0m;
var result = Vector128.Shuffle(t1, Vector128.Create(1, 2, 3, 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new code is much harder to read for basically a cycle difference in perf (numbers above report 0.7 -> 0.5 ns). It may also not optimize as well when considering all platforms or ISA baselines (SSE2, SSE4.1, AVX2, AVX512F, AdvSimd, WASM, etc).

It's also substantially different from Concatenate despite them doing the "same thing" (just one is x * y and the other is y * x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an operation that tends to be in the hot path for 3d applications so thought the reduced readability might be worth the extra performance on all the platforms I tested it on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not that hot for 3D apps. The actual hot code tends to happen on the GPU instead.

Its especially not so hot that the complexity is worth 0.15ns (which is effectively 1 CPU cycle).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still pending. I don't think the new code is worth the 0.15ns savings (it's notably slightly slower on my box too). The code is more complex, harder to follow, and so I don't think the justification is there to take this part of the change.

.NET is huge, it is millions of lines of code. We ultimately have to balance readability, maintainability, security, perf, etc.

While Vector3 is itself a performance oriented type, we still have to factor in the other considerations and balance them out. It also applies beyond what a simple benchmark may measure, as we have multiple platforms, ISAs, and ABIs to consider. We also have to consider how certain optimizations may function (such as inlining and constant folding).

As such, even with it being a performance oriented type, there's many places where we are happy to give up a half nanosecond (which is in practice 1-5 instruction cycles) to improve the other areas.

@tannergooding
Copy link
Member

@martenf is this something you're still working on? There's still some pending feedback above that needs to be resolved, mostly just around the tradeoff of additional complexity vs perf increase (that is, where the additional complexity isn't worth the 1-2 instruction or 0.1-0.2 nanosecond savings, so using a more readable/maintainable/portable implementation is preferred).

@tannergooding tannergooding added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jun 4, 2024
@martenf
Copy link
Contributor Author

martenf commented Jun 4, 2024

Sorry I was quite busy the last couple of weeks.
I think that except for Vector3.Transform they are all resolved, just not marked as resolved.

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jun 4, 2024
@martenf martenf closed this Jun 13, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants