blend instructions #17

nemequ · 2021-03-29T00:34:11Z

Proposed instructions

i8x16.blend(v1: v128, v2: v128, c: v128) -> v128
i16x8.blend(v1: v128, v2: v128, c: v128) -> v128
i32x4.blend(v1: v128, v2: v128, c: v128) -> v128
i64x2.blend(v1: v128, v2: v128, c: v128) -> v128

I have used the "blend" name to match the instructions on x86, but other possibilities include select or laneselect to complement the existing bitselect instruction. I don't have a strong preference.

Semantics

Using implementation-defined bit(s) of each element of c, return the corresponding bit(s) from a or b depending on whether the bit(s) in c are 1 or 0, respectively.

For each lane in c, if all bits within the lane contain the same value (i.e., the value of the entire lane is either 0 or ~0) the result must be the same as it would be using bitselect. For all other values the result is purely implementation-defined.

Rationale

This instruction would mostly be used in to choose between two values based on the result of a comparison.

While the bitselect instruction already provides similar, the implementations on x86 CPUs are generally sub-optimal unless the cmov instruction from XOP or the 128-bit vpternlogd from AVX-512VL is available. See WebAssembly/simd#192 for a discussion of the issue.

Possible implementations

On most architectures this would be a duplicate of bitselect, which can be handled efficiently on most architectures, including:

Arm NEON: bsl
MIPS MSA: bsel
POWER AltiVec: xxsel

However, unlike bitselect an implementation is free to use a subset of bit(s) from c to determine which bits to select from a and which to select from b. For example, on SSE4.1 there are several blendv instructions which use only the most-significant bit to select an entire lane from either a or b. These instructions could be used to implement the proposed blend instruction:

8-bit and 16-bit: pblendvb
32-bit: blendvps
64-bit: blendvpd

Note that 16-bit blends are not present on x86, but it is safe to use the 8-bit variant since as long as the most significant bit from the least significant byte matches the most significant bit of the entire 16-bit value the result will be the same. Technically pblendvb could be used for all implementations, which may be advantageous on some micro-architectures to avoid transitions between integer and floating point units.

If falling back on WASM SIMD128, this instruction could degrate to a bitselect.

Differences across processors

If all bits in a lane are set to the same value all implementations will behave the same.

If any bits within a lane differ, however, some implementations (such as x86) may copy the entire lane from either a or b depending on a value of a bit of their choosing. On x86 the most significant bit is used, but I don't see a compelling reason to require any particular bit be used.

Potential fingerprinting exposure

It would be trivial to detect whether the process is running on a CPU with SSE4.1 or something else based on examining the results of an operation where the bits in the control mask vary within a lane or not. To my knowledge only x86 CPUs would currently use an implementation which differs from bitselect, though it is conceivable that some other CPU may implement a similar instruction based on (for example) the lest significant bit, which would also allow for easy identification of that architecture.

The text was updated successfully, but these errors were encountered:

Maratyszcza · 2021-03-29T08:08:29Z

I'd suggest to change the semantics a bit: the instruction would guarantee the result only when all bits of the mast lane are 0 or all bits of the mask lane are 1. In other cases, the result is implementation-specific.

With the above formulation, 16-bit blends could use [V]PBLENDVB too.

nemequ · 2021-03-29T14:00:15Z

I believe what I had already did that, but I've tried to make it a bit clearer, mainly by moving the part about a possible implementation using the most significant bit only to the possible implementations section. If it's still not clear let me know what is causing confusion (or, if you have permission, feel free to edit it yourself).

You're right about pblendvb for 16-bit values. Actually, I guess you could use pblendvb for all implementations; I've updated that part, too, thanks!

Maratyszcza · 2021-03-29T17:09:20Z

It is worth noting that this will potentially expose fingerprinting of Atom-like processors (Silvermont, Airmont, Goldmont cores). On these processors BLENDV instructions are more expensive than the emulation v128.bitselect emulation sequence (see WebAssembly/simd#124 for details), and optimized WAsm engines might prefer to generate the latter.

ngzhian · 2021-03-30T17:39:59Z

Nice write-up!

When all bits in c contain the same value (i.e., the value is either 0 or ~0) the result must be the same as it would be using bitselect. For all other values the result is purely implementation-defined.

I agree with Marat's comment on the wording in the semantics, currently it looks too constrained (in that the behavior is specified only for this 2 values.) We could specify it as "all bits in each lane of c", and the resulting implementation would be the same right? And this semantics would match the instruction names more.

nemequ · 2021-03-30T23:54:42Z

The original version had a paragraph about possibly using only the most-significant bit to determine the entire lane; I think that's what Marat was talking about. I removed that part (which really didn't belong in that section).

currently it looks too constrained (in that the behavior is specified only for this 2 values.) We could specify it as "all bits in each lane of c", and the resulting implementation would be the same right?

Argh, you're right, that's phrased very badly. The original version had a paragraph about possibly using only the most-significant bit to determine the entire lane; I thought that's what Marat was talking about.

"all bits in each lane of c" could misinterpreted… How about something like: "For each lane in c, if all bits within the lane contain the same value (i.e., the value of the entire lane is either 0 or ~0) the result must be the same as it would be using bitselect. For all other values the result is purely implementation-defined."?

And this semantics would match the instruction names more.

"Blend" really feels wrong here. I know it's what x86 uses so that's what I went with, but I'm thinking select or laneselect better communicates the idea to people not coming from x86.

ngzhian · 2021-03-31T17:51:19Z

"all bits in each lane of c" could misinterpreted… How about something like: "For each lane in c, if all bits within the lane contain the same value (i.e., the value of the entire lane is either 0 or ~0) the result must be the same as it would be using bitselect. For all other values the result is purely implementation-defined."?

Sg!

"Blend" really feels wrong here. I know it's what x86 uses so that's what I went with, but I'm thinking select or laneselect better communicates the idea to people not coming from x86.

I was referring to the "i8x16" in the instruction name, and not the "blend" part of it :) The "i8" suggests that each lane is 8-bits, and operated on separately from the other "i8" lanes. That matches the "for each lane" wording you suggested.

I'm not attached to the name "blend", it could cause confusion, but will be helpful for someone with Intel experience to understand what's going on. Laneselect sounds good too.

zeux · 2021-10-01T05:45:28Z

FWIW I've also found the name "blend" confusing even though my main experience lies within the SSE domain. "laneselect" seems much more clear.

As suggested in WebAssembly#17 (comment).

As suggested in #17 (comment).

ngzhian · 2021-11-01T19:10:25Z

Note: vbsl is not available on Armv8-M.

ngzhian · 2021-11-01T19:40:15Z

Note: for RISCV-V V, it probably maps to a code sequence that uses vrgather where out of range returns 0, but it only selects from 1 register, so we will need 2 calls to this.

nemequ added the instruction-proposal label Mar 29, 2021

ngzhian mentioned this issue Jul 7, 2021

Add relaxed blend to overview #29

Merged

ngzhian added a commit to ngzhian/relaxed-simd that referenced this issue Oct 1, 2021

Rename to laneselect

bfdea53

As suggested in WebAssembly#17 (comment).

ngzhian mentioned this issue Oct 1, 2021

Rename to laneselect #41

Merged

ngzhian added a commit that referenced this issue Oct 5, 2021

Rename to laneselect (#41)

e53d41f

As suggested in #17 (comment).

dtig mentioned this issue Feb 17, 2022

SIMD subgroup meeting on 2022-02-18 #50

Closed

ngzhian added the in-overview Instruction has been added to Overview.md label Feb 18, 2022

tomrittervg mentioned this issue Jun 16, 2022

WebAssembly Relaxed SIMD mozilla/standards-positions#651

Open

alexcrichton mentioned this issue Feb 22, 2023

Codegen of i16x8.relaxed_laneselect #125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blend instructions #17

blend instructions #17

nemequ commented Mar 29, 2021 •

edited

Loading

Maratyszcza commented Mar 29, 2021 •

edited

Loading

nemequ commented Mar 29, 2021

Maratyszcza commented Mar 29, 2021

ngzhian commented Mar 30, 2021

nemequ commented Mar 30, 2021

ngzhian commented Mar 31, 2021

zeux commented Oct 1, 2021

ngzhian commented Nov 1, 2021

ngzhian commented Nov 1, 2021

blend instructions #17

blend instructions #17

Comments

nemequ commented Mar 29, 2021 • edited Loading

Proposed instructions

Semantics

Rationale

Possible implementations

Differences across processors

Potential fingerprinting exposure

Maratyszcza commented Mar 29, 2021 • edited Loading

nemequ commented Mar 29, 2021

Maratyszcza commented Mar 29, 2021

ngzhian commented Mar 30, 2021

nemequ commented Mar 30, 2021

ngzhian commented Mar 31, 2021

zeux commented Oct 1, 2021

ngzhian commented Nov 1, 2021

ngzhian commented Nov 1, 2021

nemequ commented Mar 29, 2021 •

edited

Loading

Maratyszcza commented Mar 29, 2021 •

edited

Loading