Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blend instructions #17

Open
nemequ opened this issue Mar 29, 2021 · 9 comments
Open

blend instructions #17

nemequ opened this issue Mar 29, 2021 · 9 comments
Labels
in-overview Instruction has been added to Overview.md instruction-proposal

Comments

@nemequ
Copy link

nemequ commented Mar 29, 2021

Proposed instructions

  • i8x16.blend(v1: v128, v2: v128, c: v128) -> v128
  • i16x8.blend(v1: v128, v2: v128, c: v128) -> v128
  • i32x4.blend(v1: v128, v2: v128, c: v128) -> v128
  • i64x2.blend(v1: v128, v2: v128, c: v128) -> v128

I have used the "blend" name to match the instructions on x86, but other possibilities include select or laneselect to complement the existing bitselect instruction. I don't have a strong preference.

Semantics

Using implementation-defined bit(s) of each element of c, return the corresponding bit(s) from a or b depending on whether the bit(s) in c are 1 or 0, respectively.

For each lane in c, if all bits within the lane contain the same value (i.e., the value of the entire lane is either 0 or ~0) the result must be the same as it would be using bitselect. For all other values the result is purely implementation-defined.

Rationale

This instruction would mostly be used in to choose between two values based on the result of a comparison.

While the bitselect instruction already provides similar, the implementations on x86 CPUs are generally sub-optimal unless the cmov instruction from XOP or the 128-bit vpternlogd from AVX-512VL is available. See WebAssembly/simd#192 for a discussion of the issue.

Possible implementations

On most architectures this would be a duplicate of bitselect, which can be handled efficiently on most architectures, including:

  • Arm NEON: bsl
  • MIPS MSA: bsel
  • POWER AltiVec: xxsel

However, unlike bitselect an implementation is free to use a subset of bit(s) from c to determine which bits to select from a and which to select from b. For example, on SSE4.1 there are several blendv instructions which use only the most-significant bit to select an entire lane from either a or b. These instructions could be used to implement the proposed blend instruction:

Note that 16-bit blends are not present on x86, but it is safe to use the 8-bit variant since as long as the most significant bit from the least significant byte matches the most significant bit of the entire 16-bit value the result will be the same. Technically pblendvb could be used for all implementations, which may be advantageous on some micro-architectures to avoid transitions between integer and floating point units.

If falling back on WASM SIMD128, this instruction could degrate to a bitselect.

Differences across processors

If all bits in a lane are set to the same value all implementations will behave the same.

If any bits within a lane differ, however, some implementations (such as x86) may copy the entire lane from either a or b depending on a value of a bit of their choosing. On x86 the most significant bit is used, but I don't see a compelling reason to require any particular bit be used.

Potential fingerprinting exposure

It would be trivial to detect whether the process is running on a CPU with SSE4.1 or something else based on examining the results of an operation where the bits in the control mask vary within a lane or not. To my knowledge only x86 CPUs would currently use an implementation which differs from bitselect, though it is conceivable that some other CPU may implement a similar instruction based on (for example) the lest significant bit, which would also allow for easy identification of that architecture.

@Maratyszcza
Copy link
Collaborator

Maratyszcza commented Mar 29, 2021

I'd suggest to change the semantics a bit: the instruction would guarantee the result only when all bits of the mast lane are 0 or all bits of the mask lane are 1. In other cases, the result is implementation-specific.

With the above formulation, 16-bit blends could use [V]PBLENDVB too.

@nemequ
Copy link
Author

nemequ commented Mar 29, 2021

I believe what I had already did that, but I've tried to make it a bit clearer, mainly by moving the part about a possible implementation using the most significant bit only to the possible implementations section. If it's still not clear let me know what is causing confusion (or, if you have permission, feel free to edit it yourself).

You're right about pblendvb for 16-bit values. Actually, I guess you could use pblendvb for all implementations; I've updated that part, too, thanks!

@Maratyszcza
Copy link
Collaborator

It is worth noting that this will potentially expose fingerprinting of Atom-like processors (Silvermont, Airmont, Goldmont cores). On these processors BLENDV instructions are more expensive than the emulation v128.bitselect emulation sequence (see WebAssembly/simd#124 for details), and optimized WAsm engines might prefer to generate the latter.

@ngzhian
Copy link
Member

ngzhian commented Mar 30, 2021

Nice write-up!

When all bits in c contain the same value (i.e., the value is either 0 or ~0) the result must be the same as it would be using bitselect. For all other values the result is purely implementation-defined.

I agree with Marat's comment on the wording in the semantics, currently it looks too constrained (in that the behavior is specified only for this 2 values.) We could specify it as "all bits in each lane of c", and the resulting implementation would be the same right? And this semantics would match the instruction names more.

@nemequ
Copy link
Author

nemequ commented Mar 30, 2021

The original version had a paragraph about possibly using only the most-significant bit to determine the entire lane; I think that's what Marat was talking about. I removed that part (which really didn't belong in that section).

currently it looks too constrained (in that the behavior is specified only for this 2 values.) We could specify it as "all bits in each lane of c", and the resulting implementation would be the same right?

Argh, you're right, that's phrased very badly. The original version had a paragraph about possibly using only the most-significant bit to determine the entire lane; I thought that's what Marat was talking about.

"all bits in each lane of c" could misinterpreted… How about something like: "For each lane in c, if all bits within the lane contain the same value (i.e., the value of the entire lane is either 0 or ~0) the result must be the same as it would be using bitselect. For all other values the result is purely implementation-defined."?

And this semantics would match the instruction names more.

"Blend" really feels wrong here. I know it's what x86 uses so that's what I went with, but I'm thinking select or laneselect better communicates the idea to people not coming from x86.

@ngzhian
Copy link
Member

ngzhian commented Mar 31, 2021

"all bits in each lane of c" could misinterpreted… How about something like: "For each lane in c, if all bits within the lane contain the same value (i.e., the value of the entire lane is either 0 or ~0) the result must be the same as it would be using bitselect. For all other values the result is purely implementation-defined."?

Sg!

"Blend" really feels wrong here. I know it's what x86 uses so that's what I went with, but I'm thinking select or laneselect better communicates the idea to people not coming from x86.

I was referring to the "i8x16" in the instruction name, and not the "blend" part of it :) The "i8" suggests that each lane is 8-bits, and operated on separately from the other "i8" lanes. That matches the "for each lane" wording you suggested.

I'm not attached to the name "blend", it could cause confusion, but will be helpful for someone with Intel experience to understand what's going on. Laneselect sounds good too.

@zeux
Copy link

zeux commented Oct 1, 2021

FWIW I've also found the name "blend" confusing even though my main experience lies within the SSE domain. "laneselect" seems much more clear.

ngzhian added a commit to ngzhian/relaxed-simd that referenced this issue Oct 1, 2021
ngzhian added a commit that referenced this issue Oct 5, 2021
@ngzhian
Copy link
Member

ngzhian commented Nov 1, 2021

Note: vbsl is not available on Armv8-M.

@ngzhian
Copy link
Member

ngzhian commented Nov 1, 2021

Note: for RISCV-V V, it probably maps to a code sequence that uses vrgather where out of range returns 0, but it only selects from 1 register, so we will need 2 calls to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-overview Instruction has been added to Overview.md instruction-proposal
Projects
None yet
Development

No branches or pull requests

4 participants