-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT should emit "movbe" instead of "mov / bswap" on compatible hardware #953
Comments
Probably. The instruction does not specify if it operates on signed or unsigned data and our convention has been to expose both in this scenario.
Also probably. Exposing overloads that take |
I think, based on the naming we've been giving the other intrinsics, these should be I'm also not sure
|
Also CC. @fiigii, @CarolEidt |
Can we just light up the existing BinaryPrimitives methods using the intrinsics? Who would ever want to use these intrinsics directly instead of the methods on BinaryPrimitives? |
@jkotas I imagine most people would continue to go through My understanding is that there's a desire to expose most of the non-base instruction sets as we have already done for SSE, POPCNT, BMI1/2, etc. This could be useful for people writing very high-performance code. If I'm reading Agner's instruction tables correctly, on Skylake and later architectures a 64-bit MOVBE load instruction can be dispatched to a wider range of execution ports at a lower latency than a 64-bit MOV load followed by a 64-bit BSWAP. A developer who has custom unrolled a hot loop may wish to use a different loop unrolling technique based on the availability of this instruction set. (Somebody more familiar with CPU architecture will have to check me on this.) |
The hardware intrinsics made complete sense for SIMD instructions. I think they have diminishing value for regular bit operations. Looking at patterns that are developing around Bmi1 - I am not sure whether it is a good idea to have that one as an intrinsics either. I understand that you can always find somebody who wants to have a super low-level control. .NET is not the right environment for these needs and it does not make sense to try to turn it into it. |
@jkotas, I'm not sure what the concern here is. Whether it is making the relevant The only real drawback, that I can see, of doing the latter (exposing new HWIntrinsics) is that we have a larger public API surface area that is exposed. However, I can see multiple benefits; including (but not limited to):
|
My concern is that the new intrinsic APIs like this have diminishing value. If I put the new public type and methods on one side of the scale and all the benefits you have mentioned on the other side of the scale, I think it is not worth adding the new intrinsic APIs like this. If we take this approach to extreme, we would have a hardware intrinsic for |
I do agree that some of these intrinsics, including
I would think the primary difference here is that Looking at native code and the types of intrinisics they expose, it doesn't include things like
So my thoughts are basically that, in the cases where we have a need to do pattern-based recognition; it is probably better to just have an explicitly handled method (at the very least, it makes the resulting code more readable, and we tend to wrap such things in helper methods ourselves anyways) and since we will then have explicitly handled methods, it would be beneficial to have them centralized, exposed, and handled in a consistent manner. The I do also think we want to be careful about just exposing intrinsics for anything and that we should ensure that new intrinsics are properly prototyped and tested to show that it is worthwhile supporting and exposing. I think, on top of what we have already exposed, there are probably a handful of instructions that are both commonly supported and frequently used enough to be worthwhile (e.g. AddWithCarry, BorrowWithSubtract, 128-bit Multiply/Divide, RotateLeft/Right). However, the good news is that the existing |
Is a public hardware specific intrinsic api necessary though? Rather than a platform neutral api method? To implement it efficiently it may require an internal intrinsic + software fallback; but that's different. To me the question is; would you write different code if you had access to the intrinsic vs having an api that will software fallback? Vector-type stuff yes, as the available instrinsics determine the algorithm; Popcnt and TrailingZeroCount meet this bar as more loosely as they impact loop unrolling choices dotnet/aspnetcore#5715 (comment) if used in sequence due to instruction latency. I'll take a |
Fair, having the same intrinsic but having it be
I think almost all the |
That's more why you'd want intrinsic to be used in preference to the software fallback; which I'd agree with 😄 The PopCnt example dotnet/aspnetcore#5715 (comment) is because the intrinsic is a higher latency instruction (3 cycles); but with a high throughput (1 cycle); so to get maximum throughput and hide the latency you'd want to change your loop to unroll it 3 times; whereas unconditionally unrolling (including the software fallback) would cause unnecessary code bloat.
|
Right, but it does support execution on multiple ports, which makes it a possible candidate for pipelining optimizations. |
For my particular scenario, I'd be fine with an enlightened Consider the below line (from the UTF-8 transcoding prototype), which occurs on a hot path: return BinaryPrimitives.ReverseEndianness((ushort)(temp + value + 0xC080u)); // [ 00000000 00000000 10xxxxxx 110yyyyy ] This eventually makes its way into an However, let me also share my experience with implementing |
An enlightened *Which may be what's needed inside the |
That would make more sense than adding yet another intrinsic. |
If we want to do this purely at the JIT pattern recognition level I'll support that. But my fear is that if we go this route this will turn into yet another issue that stays open for years, where a new API (even an internal one used as an implementation detail of an existing public helper) could allow us to get this up and running in much less time. |
We'd make use of such an API in networking. socket addresses, HTTP/2, HTTP/3 all use big-endian integers. |
We should just improve the BinaryPrimitives implementation now. Whoever improves BinaryPrimitives can do so in whatever way makes the most sense, be it an internal intrinsic that's used or a JIT optimization.
Would we actually do anything differently than we currently do? We use / should use BinaryPrimitives, and that just gets better. |
JIT optimization could help with users of |
To clarify, if this were implemented wholly in the JIT (no public API surface), then under this proposal the patterns the JIT would recognize would be:
Helper APIs like |
Unfortunately, it's hard to see how this proposed JIT optimization will be prioritized higher than many, many others. |
This might actually be something where having a small doc indicating how to add new instructions for x86/ARM/ARM64 is worthwhile. The process here for x86 is basically:
The last three steps are generally not necessary unless there is something new/unique about the instruction/intrinsic that isn't already covered or expressible by the flags. |
Looks like it's basically described in https://github.com/dotnet/runtime/blob/master/docs/design/features/hw-intrinsics.md, but if you have suggested improvements, please make them. This issue, though, seems to have morphed from an API design to a specific JIT pattern-based optimization request. |
movbe
is an intrinsic that allows you to perform a big-endian read from or a big-endian write to main memory. It's approximately equivalent to amov
followed by abswap
(or vice versa) but is more efficient. It was introduced in the Intel Atom line and eventually made its way to later generations.Proposal:
Open questions:
/cc @tannergooding
category:cq
theme:optimization
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: