-
Notifications
You must be signed in to change notification settings - Fork 43
Clarify that the SIMD narrowing instructions interpret their inputs as signed integers #91
Clarify that the SIMD narrowing instructions interpret their inputs as signed integers #91
Conversation
…s signed integers.
I'd suppose that for the signed |
Yes, consider The corresponding NEON instructions are a I think NEON could do U32->U16 saturation with |
Makes sense.
IIUC, with the current text (without this PR), the |
The I think the current text for the That doesn't seem to have a direct mapping to any SSE instruction, so I doubted that was the intent. But now I see that the V8 implementation that is the inspiration for these instructions uses that interpretation (at the cost of some performance on SSE). I'd appreciate if somebody involved in the V8 implementation could explain the trade-offs. |
The intent in the text is to have it as saturating an unsigned input to an unsigned output. The ARM variants use Vmovqn variants as @AndrewScheidecker also pointed out earlier in the thread. or Uxtn/Uxtn2 in the ARM64 case. I think the motivation was to ensure consistency on how the signed/unsigned versions of these instructions are interpreted across the proposal, making a special case for saturation here is possibly confusing. In the past, whenever we've had to make these arguments, if ARM has a valid codegen that maps to keep the operation set consistent, and SSE performance is slightly subpar, the tradeoff we've made is that comparatively the entire operation may still be fast on Intel chips, i.e emulating these will be much slower on ARM vs Intel. This is what the SSE/AVX codegen looks like in the V8 implementation. Also cc @billbudge who opened the issue in case I'm reading this wrong or missed something that motivated this set of operations. |
@AndrewScheidecker good catch! @dtig, what do you think SSE emulation for unsigned inputs would look like? In my opinion we should include more information about instruction availability in reviews. CC @rrwinterton |
@penzn - I've linked the SSE/AVX codegen in my previous response, linking here as well. I guess this didn't stand out to me quite as much, as there are quite a few other instructions we emulate on SSE/AVX, because of unavailability of instructions as the instruction set is a little asymmetric especially when it comes to the narrower types. (I16x8Splat, I8x16Mul, I8x16 Shifts are some examples I can think of off the top of my head.) |
@dtig, sorry, I meant what would this code gen do? I can get the idea by looking at V8 source, but I it is easy to get the wrong idea about the code your are not familiar with. Would it unpack the vector, extend every lane and pack it back, or is there a sequence of SSE/AVX instructions that you have in mind? |
@dtig Do we know the cost of emulating narrowing instructions in v8 x86 backend? It is useful to understand how costly these emulations are when the instructions support is not that good across architectures. Does the v8 implementation track performance across architectures? cc @PeterJensen |
@penzn The generated codegen uses the identical instructions in the link - except with registers/operands allocated, kScratchDoubleReg is used as a temp register to create a mask of 0x7FFFFFFF - that's what the first two instructions do. The inout operand here can be a register or an operand. The other instructions map directly to the mnemonics of instructions used to emulate the narrowing instructions. I'm not sure if I'm answering your question, but apart from the representation of operands/registers the mnemonics would map exactly to the x64 instructions used. @arunetm Can you clarify what you mean by cost, and what I should be comparing it to? Do you mean latency/throughput? If so, the cost of emulating the narrowing instructions can be calculated by using instruction tables for each of the instructions in the codegen, but I'm not sure what to compare it to as there is no native instruction, and it would depend on the sequence that's used in an applications. V8 has benchmarks that we use to test our code against, these are whole applications, that are supported across all architectures (Arm, Arm64, IA32, x64). These work on all architectures, and I run them to test with them - but they are not a part of our CI yet, as the Spec isn't stable yet and we are still adding/implementing more recently included operations both in the engine as well as the toolchain. |
@dtig Is there a reason one can't put the mask in read-only memory ? I haven't really tested it, but I expect something like: ;; input is in xmm0 , mask is in LABEL
pminud xmm0, xmmword ptr [LABEL]
packusdw xmm0, xmm0 which is predicted to run in ~2-3 cycles (https://gcc.godbolt.org/z/hH_8ZQ) on many micro-architectures. It definitely isn't a one cycle thing, but it doesn't appear that lowering this to SSE4.2 would be a massive gap either (even if one always recomputes the mask). |
@gnzlbg No particular reason, just that in V8 we don't use read-only memory in that way (or not that I know of), so we would have to materialize the mask, could use a constant but that would still result in atleast one extra move. There's probably other ways to materialize that mask as well. I also want to clarify that my earlier comment was about how we've made these tradeoffs in V8 only, and if there are concerns about adding these instructions to the Spec, or we should reconsider what variants of the narrowing operations make sense, feel free to open issues or PRs so we can hash out the details. |
I think it makes sense to prioritize efficiency on mobile CPUs when a trade-off must be made. In this case, it seems possible to compile a signed->unsigned saturating narrow to
Can you point me to these? I'm curious whether any of the uses of these saturating narrow instructions have a natural preference for either unsigned->unsigned or signed->unsigned semantics. |
I don't know that I believe that "it makes sense to prioritize efficiency on mobile CPUs when a trade-off must be made." We should provide the best experience on both platforms and allow the application to make the decision. I know we have had these discussions in the past. It is not in WebAssemblies best interest to default in optimizing for the least common denominator. What if we provided 2 instructions and allow the developer to choose? I many cases the differences are irrelevant in the larger application's functionality and providing 2 options in the implementation allows the developer to decide. I can provide examples if you like? It would be useful for other to provide examples why this approach wouldn't work. |
Sure, the ones that we have used to benchmark with that are currently open source are Skia, and Halide WebAssembly benchmarks. There are more that are in progress, both that we are looking at for optimizing, as well as some in the pipeline to be open sourced. The narrowing/widening operations are a part of V8 but not (yet) a part of the toolchain as they were just merged, so the WebAssembly versions for these benchmarks when compiled will not generate these instructions right now. IIRC, the motivation to include these when they were previously discussed was that -
Thanks for the pointer to vqmovun, I hadn't looked at this before, and it's possible we would be able to use this - I'll look into it as well. @rrwinterton The intent here is not to default in optimizing for the lowest common denominator, but to provide a useful abstraction that will be beneficial to applications that may use it. It is possible that there are applications that need the particular version that you mention - if so, please follow up with a PR with more details, and exact semantics of the operations that you think will be useful so we can address that separately. Potential use cases with the issue/PR are also appreciated. |
There was a discussion about criteria for adding new SIMD operations in #37. I am sorry if I missed it, @dtig and @ngzhian can you please remind of such analysis shared for #89 or #21? The PR was proposed and closed in about a day and most of us did not have a chance to respond to it - I have completely missed this side of proposed operations, for example. |
IIRC this was discussed in an in-person meeting some time ago and there were no objections in including narrowing/widening conversions in the Spec, and they were implemented in the V8. I figured this was more of an oversight on our part that the PR didn't get submitted over time and we were fixing that. I'd like to add that the issue has been open since 2017, and the PR was sent in to fix that as there were no objections on the issue (Similarly for example, HorizontalAdd was discussed, but not included due to concerns raised on the issue). As I've previously mentioned on this thread - if you have concerns about including these, or have variants you would rather see included this is an iterative process and not set in stone, so please follow up with PRs if you would like to see changes to the set of operations that were merged. |
Agree. Including features beyond "lowest common denominator" (with due diligence) is even more relevant as new non-browser wasm applications continue to evolve, often comparing their performance against native versions. Thoughts on punting unsigned versions of narrow operations (i8x16.narrow_i16x8_u & i16x8.narrow_i32x4_u) to SIMD v2, i.e. until we verify that SIMD kernels using these can avoid any regressions and provide at least reasonable gains vs. scalar when emulated? slightly off topic: May be one for the next meeting |
My current back of the envelope calculation is that emulating this would take four instructions: two to get the clamp values ( |
Verification that they don't cause a regression sounds good, though it's not clear to me what this means. Emulating this operation, or a scalar version of this I suspect would be much slower than the instructions used in codegen, but please share if you have more data or clarifications here. My preference here would be to not split a particular set of instructions over different versions of the Spec in the absence of better feature detection - we may already have enough information to make this decision, and this can be put to vote if necessary.
|
Do we know of any use cases / workloads benefiting from these two unsigned->unsigned narrowing instructions? If anyone can point to a use case I am not opposed to including these. Also, happy to verify scalar is infact slower than the simd codegen based on the implementations we have. My suggestion here is to take a closer look at the usefulness and portable performance while adding any new instructions to core spec that are not well supported across architectures. This will help to avoid baking in inefficiencies in early spec (forever!). We can always revisit these for future iterations. |
This is what the SSE
packuswb
andpackusdw
instructions do, so I'm guessing that's how these instructions are supposed to work?