Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Clarify that the SIMD narrowing instructions interpret their inputs as signed integers #91

Conversation

AndrewScheidecker
Copy link
Contributor

This is what the SSE packuswb and packusdw instructions do, so I'm guessing that's how these instructions are supposed to work?

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 12, 2019

This is what the SSE packuswb and packusdw instructions do, so I'm guessing that's how these instructions are supposed to work?

I'd suppose that for the signed _s and unsigned _u versions interpret their inputs as signed and unsigned integers, respectively. For the unsigned version, does it matter whether the inputs are interpreted as signed or unsigned integers ? (e.g. does interpreting the input as signed integer alter the result?)

@AndrewScheidecker
Copy link
Contributor Author

For the unsigned version, does it matter whether the inputs are interpreted as signed or unsigned integers ? (e.g. does interpreting the input as signed integer alter the result?)

Yes, consider i16x8.narrow_i32x4_u: if a lane is 0x80000000, should it be saturated to 0 or 65535?
packusdw on X86 will saturate it to 0.

The corresponding NEON instructions are a VQMOVUNB.S32 + VQMOVUNT.S32 pair: "Vector Saturating Move and Narrow, signed operand with Unsigned result".

I think NEON could do U32->U16 saturation with VQMOVNB.U32 + VQMOVNT.U32. AFAICT SSE doesn't have a corresponding instruction.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 12, 2019

Makes sense.

should it be saturated to 0 or 65535?

IIUC, with the current text (without this PR), the _s version would saturate to 0, and the _u version to 65535, is that correct ? (and with this change, both versions would saturate to 0).

@AndrewScheidecker
Copy link
Contributor Author

IIUC, with the current text (without this PR), the _s version would saturate to 0, and the _u version to 65535, is that correct ? (and with this change, both versions would saturate to 0).

The _s case would saturate to -32768 in both cases.

I think the current text for the _u case is a little ambiguous, but if you interpret it as saturating an unsigned input to an unsigned output, then it would saturate 0x80000000 to 65535.

That doesn't seem to have a direct mapping to any SSE instruction, so I doubted that was the intent. But now I see that the V8 implementation that is the inspiration for these instructions uses that interpretation (at the cost of some performance on SSE).

I'd appreciate if somebody involved in the V8 implementation could explain the trade-offs.

@tlively
Copy link
Member

tlively commented Aug 12, 2019

cc @ngzhian @dtig

@dtig
Copy link
Member

dtig commented Aug 12, 2019

The intent in the text is to have it as saturating an unsigned input to an unsigned output. The ARM variants use Vmovqn variants as @AndrewScheidecker also pointed out earlier in the thread. or Uxtn/Uxtn2 in the ARM64 case.

I think the motivation was to ensure consistency on how the signed/unsigned versions of these instructions are interpreted across the proposal, making a special case for saturation here is possibly confusing. In the past, whenever we've had to make these arguments, if ARM has a valid codegen that maps to keep the operation set consistent, and SSE performance is slightly subpar, the tradeoff we've made is that comparatively the entire operation may still be fast on Intel chips, i.e emulating these will be much slower on ARM vs Intel. This is what the SSE/AVX codegen looks like in the V8 implementation.

Also cc @billbudge who opened the issue in case I'm reading this wrong or missed something that motivated this set of operations.

@penzn
Copy link
Contributor

penzn commented Aug 12, 2019

@AndrewScheidecker good catch! @dtig, what do you think SSE emulation for unsigned inputs would look like?

In my opinion we should include more information about instruction availability in reviews.

CC @rrwinterton

@dtig
Copy link
Member

dtig commented Aug 12, 2019

@penzn - I've linked the SSE/AVX codegen in my previous response, linking here as well.

I guess this didn't stand out to me quite as much, as there are quite a few other instructions we emulate on SSE/AVX, because of unavailability of instructions as the instruction set is a little asymmetric especially when it comes to the narrower types. (I16x8Splat, I8x16Mul, I8x16 Shifts are some examples I can think of off the top of my head.)

@penzn
Copy link
Contributor

penzn commented Aug 12, 2019

@dtig, sorry, I meant what would this code gen do? I can get the idea by looking at V8 source, but I it is easy to get the wrong idea about the code your are not familiar with. Would it unpack the vector, extend every lane and pack it back, or is there a sequence of SSE/AVX instructions that you have in mind?

@arunetm
Copy link
Collaborator

arunetm commented Aug 12, 2019

@dtig Do we know the cost of emulating narrowing instructions in v8 x86 backend?
Agree that it's not a good idea to add exceptions to the consistency of instruction interpretations unless it breaks developers assumptions on performance across architectures. SIMD being a performance feature, developers shouldn't face any surprise loss in performance too.

It is useful to understand how costly these emulations are when the instructions support is not that good across architectures. Does the v8 implementation track performance across architectures?

cc @PeterJensen

@dtig
Copy link
Member

dtig commented Aug 12, 2019

@penzn The generated codegen uses the identical instructions in the link - except with registers/operands allocated, kScratchDoubleReg is used as a temp register to create a mask of 0x7FFFFFFF - that's what the first two instructions do. The inout operand here can be a register or an operand. The other instructions map directly to the mnemonics of instructions used to emulate the narrowing instructions. I'm not sure if I'm answering your question, but apart from the representation of operands/registers the mnemonics would map exactly to the x64 instructions used.

@arunetm Can you clarify what you mean by cost, and what I should be comparing it to? Do you mean latency/throughput? If so, the cost of emulating the narrowing instructions can be calculated by using instruction tables for each of the instructions in the codegen, but I'm not sure what to compare it to as there is no native instruction, and it would depend on the sequence that's used in an applications. V8 has benchmarks that we use to test our code against, these are whole applications, that are supported across all architectures (Arm, Arm64, IA32, x64). These work on all architectures, and I run them to test with them - but they are not a part of our CI yet, as the Spec isn't stable yet and we are still adding/implementing more recently included operations both in the engine as well as the toolchain.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 12, 2019

kScratchDoubleReg is used as a temp register to create a mask of 0x7FFFFFFF - that's what the first two instructions do.

@dtig Is there a reason one can't put the mask in read-only memory ? I haven't really tested it, but I expect something like:

;; input is in xmm0 , mask is in LABEL
pminud xmm0, xmmword ptr [LABEL]
packusdw xmm0, xmm0

which is predicted to run in ~2-3 cycles (https://gcc.godbolt.org/z/hH_8ZQ) on many micro-architectures. It definitely isn't a one cycle thing, but it doesn't appear that lowering this to SSE4.2 would be a massive gap either (even if one always recomputes the mask).

@dtig
Copy link
Member

dtig commented Aug 12, 2019

@gnzlbg No particular reason, just that in V8 we don't use read-only memory in that way (or not that I know of), so we would have to materialize the mask, could use a constant but that would still result in atleast one extra move. There's probably other ways to materialize that mask as well.

I also want to clarify that my earlier comment was about how we've made these tradeoffs in V8 only, and if there are concerns about adding these instructions to the Spec, or we should reconsider what variants of the narrowing operations make sense, feel free to open issues or PRs so we can hash out the details.

@AndrewScheidecker
Copy link
Contributor Author

In the past, whenever we've had to make these arguments, if ARM has a valid codegen that maps to keep the operation set consistent, and SSE performance is slightly subpar, the tradeoff we've made is that comparatively the entire operation may still be fast on Intel chips, i.e emulating these will be much slower on ARM vs Intel.

I think it makes sense to prioritize efficiency on mobile CPUs when a trade-off must be made.

In this case, it seems possible to compile a signed->unsigned saturating narrow to vqmovun, so a trade-off isn't necessary if those semantics are equally useful to applications.

V8 has benchmarks that we use to test our code against, these are whole applications, that are supported across all architectures (Arm, Arm64, IA32, x64). These work on all architectures, and I run them to test with them - but they are not a part of our CI yet, as the Spec isn't stable yet and we are still adding/implementing more recently included operations both in the engine as well as the toolchain.

Can you point me to these? I'm curious whether any of the uses of these saturating narrow instructions have a natural preference for either unsigned->unsigned or signed->unsigned semantics.

@rrwinterton
Copy link

I don't know that I believe that "it makes sense to prioritize efficiency on mobile CPUs when a trade-off must be made." We should provide the best experience on both platforms and allow the application to make the decision. I know we have had these discussions in the past. It is not in WebAssemblies best interest to default in optimizing for the least common denominator. What if we provided 2 instructions and allow the developer to choose? I many cases the differences are irrelevant in the larger application's functionality and providing 2 options in the implementation allows the developer to decide. I can provide examples if you like? It would be useful for other to provide examples why this approach wouldn't work.

@dtig
Copy link
Member

dtig commented Aug 12, 2019

In the past, whenever we've had to make these arguments, if ARM has a valid codegen that maps to keep the operation set consistent, and SSE performance is slightly subpar, the tradeoff we've made is that comparatively the entire operation may still be fast on Intel chips, i.e emulating these will be much slower on ARM vs Intel.

I think it makes sense to prioritize efficiency on mobile CPUs when a trade-off must be made.

In this case, it seems possible to compile a signed->unsigned saturating narrow to vqmovun, so a trade-off isn't necessary if those semantics are equally useful to applications.

V8 has benchmarks that we use to test our code against, these are whole applications, that are supported across all architectures (Arm, Arm64, IA32, x64). These work on all architectures, and I run them to test with them - but they are not a part of our CI yet, as the Spec isn't stable yet and we are still adding/implementing more recently included operations both in the engine as well as the toolchain.

Can you point me to these? I'm curious whether any of the uses of these saturating narrow instructions have a natural preference for either unsigned->unsigned or signed->unsigned semantics.

Sure, the ones that we have used to benchmark with that are currently open source are Skia, and Halide WebAssembly benchmarks. There are more that are in progress, both that we are looking at for optimizing, as well as some in the pipeline to be open sourced. The narrowing/widening operations are a part of V8 but not (yet) a part of the toolchain as they were just merged, so the WebAssembly versions for these benchmarks when compiled will not generate these instructions right now.

IIRC, the motivation to include these when they were previously discussed was that -

  • They are awkward to emulate using the current set of SIMD operations in the spec
  • Aside from the unsigned versions under discussion, they are all broadly supported on architectures
  • Useful for codecs

Thanks for the pointer to vqmovun, I hadn't looked at this before, and it's possible we would be able to use this - I'll look into it as well.

@rrwinterton The intent here is not to default in optimizing for the lowest common denominator, but to provide a useful abstraction that will be beneficial to applications that may use it. It is possible that there are applications that need the particular version that you mention - if so, please follow up with a PR with more details, and exact semantics of the operations that you think will be useful so we can address that separately. Potential use cases with the issue/PR are also appreciated.

@penzn
Copy link
Contributor

penzn commented Aug 13, 2019

There was a discussion about criteria for adding new SIMD operations in #37. I am sorry if I missed it, @dtig and @ngzhian can you please remind of such analysis shared for #89 or #21? The PR was proposed and closed in about a day and most of us did not have a chance to respond to it - I have completely missed this side of proposed operations, for example.

@dtig
Copy link
Member

dtig commented Aug 13, 2019

IIRC this was discussed in an in-person meeting some time ago and there were no objections in including narrowing/widening conversions in the Spec, and they were implemented in the V8. I figured this was more of an oversight on our part that the PR didn't get submitted over time and we were fixing that. I'd like to add that the issue has been open since 2017, and the PR was sent in to fix that as there were no objections on the issue (Similarly for example, HorizontalAdd was discussed, but not included due to concerns raised on the issue). As I've previously mentioned on this thread - if you have concerns about including these, or have variants you would rather see included this is an iterative process and not set in stone, so please follow up with PRs if you would like to see changes to the set of operations that were merged.

@arunetm
Copy link
Collaborator

arunetm commented Aug 13, 2019

The intent here is not to default in optimizing for the lowest common denominator, but to provide a useful abstraction that will be beneficial to applications that may use it.

Agree. Including features beyond "lowest common denominator" (with due diligence) is even more relevant as new non-browser wasm applications continue to evolve, often comparing their performance against native versions.

Thoughts on punting unsigned versions of narrow operations (i8x16.narrow_i16x8_u & i16x8.narrow_i32x4_u) to SIMD v2, i.e. until we verify that SIMD kernels using these can avoid any regressions and provide at least reasonable gains vs. scalar when emulated?
(If it looks like a good idea, happy to put together the PR).

slightly off topic: May be one for the next meeting
It's ideal to hold back or document instructions needing significant emulations that may not speedup or even degrade kernel performance compared to scalar on certain platforms. Tools may be able to warn developers while generating those or provide ways (flags ?) to avoid them when portable performance is important for their applications (or prefer when apps are targeting certain platforms only).

@penzn
Copy link
Contributor

penzn commented Aug 13, 2019

My current back of the envelope calculation is that emulating this would take four instructions: two to get the clamp values (0x7F..F) in every lane and two for the narrowing (one for fixing up the input and one for packing).

@penzn penzn mentioned this pull request Aug 14, 2019
@dtig
Copy link
Member

dtig commented Aug 15, 2019

The intent here is not to default in optimizing for the lowest common denominator, but to provide a useful abstraction that will be beneficial to applications that may use it.

Agree. Including features beyond "lowest common denominator" (with due diligence) is even more relevant as new non-browser wasm applications continue to evolve, often comparing their performance against native versions.

Thoughts on punting unsigned versions of narrow operations (i8x16.narrow_i16x8_u & i16x8.narrow_i32x4_u) to SIMD v2, i.e. until we verify that SIMD kernels using these can avoid any regressions and provide at least reasonable gains vs. scalar when emulated?
(If it looks like a good idea, happy to put together the PR).

Verification that they don't cause a regression sounds good, though it's not clear to me what this means. Emulating this operation, or a scalar version of this I suspect would be much slower than the instructions used in codegen, but please share if you have more data or clarifications here. My preference here would be to not split a particular set of instructions over different versions of the Spec in the absence of better feature detection - we may already have enough information to make this decision, and this can be put to vote if necessary.

slightly off topic: May be one for the next meeting
It's ideal to hold back or document instructions needing significant emulations that may not speedup or even degrade kernel performance compared to scalar on certain platforms. Tools may be able to warn developers while generating those or provide ways (flags ?) to avoid them when portable performance is important for their applications (or prefer when apps are targeting certain platforms only).

@arunetm
Copy link
Collaborator

arunetm commented Aug 19, 2019

Verification that they don't cause a regression sounds good, though it's not clear to me what this means. Emulating this operation, or a scalar version of this I suspect would be much slower than the instructions used in codegen, but please share if you have more data or clarifications here. My preference here would be to not split a particular set of instructions over different versions of the Spec in the absence of better feature detection - we may already have enough information to make this decision, and this can be put to vote if necessary.

Do we know of any use cases / workloads benefiting from these two unsigned->unsigned narrowing instructions? If anyone can point to a use case I am not opposed to including these. Also, happy to verify scalar is infact slower than the simd codegen based on the implementations we have.

My suggestion here is to take a closer look at the usefulness and portable performance while adding any new instructions to core spec that are not well supported across architectures. This will help to avoid baking in inefficiencies in early spec (forever!). We can always revisit these for future iterations.

@AndrewScheidecker AndrewScheidecker merged commit 85e6e7a into WebAssembly:master Aug 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants