-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to optimize the Vector4 element constructor in mono interp #86782
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsLooking at dotnet/perf-autofiling-issues#18031 in wasm+interp, the hot path of some of the failures boils down to If we initialize Vector4 as if it were a Vector128, we can tap into the interpreter's SIMD support and initialize it in (almost) one vectorized operation, and then on WASM that will use the native vector instruction set. With these changes a
I don't know if we want to do this, or whether this is a good way to do it. But I figured a draft PR is a good way to start the discussion. I suspect if we optimize this one ctor a bunch of stuff will get faster in the wasm interp scenario (and maybe all interp targets, but it's hard to be certain).
|
Learned today that 'ref this' is legal in structs now, I don't think it used to be? In any case, that eliminates the ldflda:
|
I'm not convinced having this on the C# side is overall good design. I think it would make more sense for the interpreter to recognize Also the output doesn't seem properly optimized and I think doing this by intrinsifying the ctor method as a whole might be better.
I'm currently working on reusing vectorization for |
The Vector ctor is marked as Intrinsic, so that suggests the runtime should optimize it. |
Agreed and that's how RyuJIT handles it today. -- As an aside, It was recently brought up if we could do more of these in managed code, so we don't have to duplicate logic between RyuJIT and Mono; but there is a general concern over how that will impact codegen and throughput in scenarios these are heavily used. It will certainly put more pressure on the inliner to do the right thing. |
My thinking was that doing it in managed code would mean that AOT would automatically pick up the improvement, but it sounds like intrinsifying it in the interpreter will produce better code without stressing the inliner, so I'm not opposed to doing that instead. I'll give that a try later. |
Looking at dotnet/perf-autofiling-issues#18031 in wasm+interp, the hot path of some of the failures boils down to
foo.AsVector128()
, whereAsVector128
itself boils down tonew Vector4(this, ...)
, which then translates tonew Vector4(this.X, this.Y, ...)
. Because the Vector4 ctor just assigns 4 fields sequentially, this generates a bunch of interp opcodes.If we initialize Vector4 as if it were a Vector128, we can tap into the interpreter's SIMD support and initialize it in (almost) one vectorized operation, and then on WASM that will use the native vector instruction set.
With these changes a
Vector2.AsVector128()
call looks roughly like this in the mono interpreter, annotated with my reading of the opcodes:I don't know if we want to do this, or whether this is a good way to do it. But I figured a draft PR is a good way to start the discussion. I suspect if we optimize this one ctor a bunch of stuff will get faster in the wasm interp scenario (and maybe all interp targets, but it's hard to be certain).