-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blend instructions #17
Comments
I'd suggest to change the semantics a bit: the instruction would guarantee the result only when all bits of the mast lane are 0 or all bits of the mask lane are 1. In other cases, the result is implementation-specific. With the above formulation, 16-bit blends could use |
I believe what I had already did that, but I've tried to make it a bit clearer, mainly by moving the part about a possible implementation using the most significant bit only to the possible implementations section. If it's still not clear let me know what is causing confusion (or, if you have permission, feel free to edit it yourself). You're right about pblendvb for 16-bit values. Actually, I guess you could use pblendvb for all implementations; I've updated that part, too, thanks! |
It is worth noting that this will potentially expose fingerprinting of Atom-like processors (Silvermont, Airmont, Goldmont cores). On these processors |
Nice write-up!
I agree with Marat's comment on the wording in the semantics, currently it looks too constrained (in that the behavior is specified only for this 2 values.) We could specify it as "all bits in each lane of c", and the resulting implementation would be the same right? And this semantics would match the instruction names more. |
The original version had a paragraph about possibly using only the most-significant bit to determine the entire lane; I think that's what Marat was talking about. I removed that part (which really didn't belong in that section).
Argh, you're right, that's phrased very badly. The original version had a paragraph about possibly using only the most-significant bit to determine the entire lane; I thought that's what Marat was talking about. "all bits in each lane of c" could misinterpreted… How about something like: "For each lane in
"Blend" really feels wrong here. I know it's what x86 uses so that's what I went with, but I'm thinking select or laneselect better communicates the idea to people not coming from x86. |
Sg!
I was referring to the "i8x16" in the instruction name, and not the "blend" part of it :) The "i8" suggests that each lane is 8-bits, and operated on separately from the other "i8" lanes. That matches the "for each lane" wording you suggested. I'm not attached to the name "blend", it could cause confusion, but will be helpful for someone with Intel experience to understand what's going on. Laneselect sounds good too. |
FWIW I've also found the name "blend" confusing even though my main experience lies within the SSE domain. "laneselect" seems much more clear. |
As suggested in WebAssembly#17 (comment).
Note: vbsl is not available on Armv8-M. |
Note: for RISCV-V V, it probably maps to a code sequence that uses |
Proposed instructions
i8x16.blend(v1: v128, v2: v128, c: v128) -> v128
i16x8.blend(v1: v128, v2: v128, c: v128) -> v128
i32x4.blend(v1: v128, v2: v128, c: v128) -> v128
i64x2.blend(v1: v128, v2: v128, c: v128) -> v128
I have used the "blend" name to match the instructions on x86, but other possibilities include
select
orlaneselect
to complement the existingbitselect
instruction. I don't have a strong preference.Semantics
Using implementation-defined bit(s) of each element of
c
, return the corresponding bit(s) froma
orb
depending on whether the bit(s) inc
are 1 or 0, respectively.For each lane in c, if all bits within the lane contain the same value (i.e., the value of the entire lane is either 0 or ~0) the result must be the same as it would be using bitselect. For all other values the result is purely implementation-defined.
Rationale
This instruction would mostly be used in to choose between two values based on the result of a comparison.
While the
bitselect
instruction already provides similar, the implementations on x86 CPUs are generally sub-optimal unless thecmov
instruction from XOP or the 128-bitvpternlogd
from AVX-512VL is available. See WebAssembly/simd#192 for a discussion of the issue.Possible implementations
On most architectures this would be a duplicate of
bitselect
, which can be handled efficiently on most architectures, including:bsl
bsel
xxsel
However, unlike
bitselect
an implementation is free to use a subset of bit(s) fromc
to determine which bits to select froma
and which to select fromb
. For example, on SSE4.1 there are severalblendv
instructions which use only the most-significant bit to select an entire lane from eithera
orb
. These instructions could be used to implement the proposedblend
instruction:pblendvb
blendvps
blendvpd
Note that 16-bit blends are not present on x86, but it is safe to use the 8-bit variant since as long as the most significant bit from the least significant byte matches the most significant bit of the entire 16-bit value the result will be the same. Technically
pblendvb
could be used for all implementations, which may be advantageous on some micro-architectures to avoid transitions between integer and floating point units.If falling back on WASM SIMD128, this instruction could degrate to a
bitselect
.Differences across processors
If all bits in a lane are set to the same value all implementations will behave the same.
If any bits within a lane differ, however, some implementations (such as x86) may copy the entire lane from either
a
orb
depending on a value of a bit of their choosing. On x86 the most significant bit is used, but I don't see a compelling reason to require any particular bit be used.Potential fingerprinting exposure
It would be trivial to detect whether the process is running on a CPU with SSE4.1 or something else based on examining the results of an operation where the bits in the control mask vary within a lane or not. To my knowledge only x86 CPUs would currently use an implementation which differs from
bitselect
, though it is conceivable that some other CPU may implement a similar instruction based on (for example) the lest significant bit, which would also allow for easy identification of that architecture.The text was updated successfully, but these errors were encountered: