This proposal adds a set of useful SIMD instructions that introduce local non-determinism (where the results of the instructions may vary based on hardware support).
Applications running on Wasm SIMD cannot take full advantage of hardware capabilities. There are 3 reasons:
- Instruction depends on hardware support
- Approximate instructions that are underspecified in hardware
- Some SIMD instructions penalize particular architecture
See these slides for more details.
Broadly, there are three categories of instructions that fit into the Relaxed SIMD proposal:
- Integer instructions where the inputs are interpreted differently (e.g. swizzle, 4-D dot-product)
- Floating-point instructions whose behavior for out-of-range and NaNs differ (e.g. float-to-int conversions, float min/max)
- Floating-point instructions where the precision or order of operations differ (e.g. FMA, reciprocal instructions, sum reduction)
Example of some instructions we would like to add:
- Fused Multiply Add (single rounding if hardware supports it, double rounding if not)
- Approximate reciprocal/reciprocal sqrt
- Relaxed Swizzle (implementation defined out of bounds behavior)
- Relaxed Rounding Q-format Multiplication (optional saturation)
This proposal introduces non-deterministic instructions - given the same inputs, two calls to the same instruction can return different results. For example:
(module
(func (param v128 v128 v128)
(f32x4.qfma (local.get 0) (local.get 1) (local.get 2)) ;; (1)
;; some other computation
(f32x4.qfma (local.get 0) (local.get 1) (local.get 2)))) ;; (2)
The same instruction at (1)
and (2)
, when given the same inputs, can return
two different results. This is compliant as the instruction is
non-deterministic, though unlikely on certain embeddings like the Web (where
the same implementation for f32x4.qfma
is likely to be used for all calls to
that instruction). One can imagine splitting an application's module and
running them on multiple runtimes, where the runtimes produce
different results - this can be surprising to the application.
The specification is updated with the idea of "Relaxed operations":
Some operators are host-dependent, because the set of possible results may depend on properties of the host environment (such as hardware). Technically, each such operator produces a fixed-size list of sets of allowed values. For each execution of the operator in the same environment, only values from the set at the same position in the list are returned, i.e., each environment globally chooses a fixed projection for each operator.
IMPLEMENTATION_DEFINED_ONE_OF(...)
returns any one of the results in the argument list, depending on the implementation.
relaxed i8x16.swizzle(a : v128, s : v128) -> v128
relaxed i8x16.swizzle(a, s)
selects lanes from a
using indices in s
,
indices in the range [0,15]
will select the i
-th element of a
, the result
for any out of range indices is implementation-defined (i.e. if the index is
[16-255]
.
def relaxed_i8x16_swizzle(a : i8x16, s : i8x16):
result = []
for i in range(16):
if s[i] < 16:
result[i] = a[s[i]]
else if s[i] < 128:
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(0, a[s[i]%16])
else:
result[i] = 0
return result
relaxed i32x4.trunc_f32x4_s
(relaxed version ofi32x4.trunc_sat_f32x4_s
)relaxed i32x4.trunc_f32x4_u
(relaxed version ofi32x4.trunc_sat_f32x4_u
)relaxed i32x4.trunc_f64x2_s_zero
(relaxed version ofi32x4.trunc_sat_f64x2_s_zero
)relaxed i32x4.trunc_f64x2_u_zero
(relaxed version ofi32x4.trunc_sat_f64x2_u_zero
)
These instructions have the same behavior as the non-relaxed instructions for
lanes that are in the range of an i32
(signed or unsigned depending on the
instruction). The result of lanes which contain NaN is implementation defined,
either 0 or INT32_MAX
for signed and UINT32_MAX
for unsigned. The result of
lanes which are out of bounds of INT32
or UINT32
is implementation defined,
it can be either the saturated result or INT32_MAX
for signed and UINT32_MAX
for unsigned.
def relaxed_i32x4_trunc_f32x4_s(a : f32x4) -> i32x4:
result = [0, 0, 0, 0]
for i in range(4):
if isnan(a[i]):
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(0, INT32_MIN)
continue
r = truncate(a[i])
if r < INT32_MIN:
result[i] = INT32_MIN
elif r > INT32_MAX
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(INT32_MIN, INT32_MAX)
else:
result[i] = r
def relaxed_i32x4_trunc_f32x4_u(a : f32x4) -> i32x4:
result = [0, 0, 0, 0]
for i in range(4):
if isnan(a[i]):
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(0, UINT32_MAX)
continue
r = truncate(a[i])
if r < UINT32_MIN:
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(UINT32_MIN, UINT32_MAX)
elif r > UINT32_MAX:
result[i] = UINT32_MAX
else:
result[i] = r
def relaxed_i32x4_trunc_f64x2_zero_s(a : f64x2) -> i32x4:
result = [0, 0, 0, 0]
for i in range(2):
if isnan(a[i]):
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(0, INT32_MIN)
continue
r = truncate(a[i])
if r < INT32_MIN:
result[i] = INT32_MIN
elif r > INT32_MAX
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(INT32_MIN, INT32_MAX)
else:
result[i] = r
def relaxed_i32x4_trunc_f64x2_zero_u(a : f64x2) -> i32x4:
result = [0, 0, 0, 0]
for i in range(2):
if isnan(a[i]):
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(0, UINT32_MAX)
continue
r = truncate(a[i])
if r < UINT32_MIN:
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(UINT32_MIN, UINT32_MAX)
elif r > UINT32_MAX:
result[i] = UINT32_MAX
else:
result[i] = r
relaxed f32x4.madd
relaxed f32x4.nmadd
relaxed f64x2.madd
relaxed f64x2.nmadd
All the instructions take 3 operands, a
, b
, c
, perform a * b + c
or -(a * b) + c
:
relaxed f32x4.madd(a, b, c) = a * b + c
relaxed f32x4.nmadd(a, b, c) = -(a * b) + c
relaxed f64x2.madd(a, b, c) = a * b + c
relaxed f64x2.nmadd(a, b, c) = -(a * b) + c
where:
- the intermediate
a * b
is be rounded first, and the final result rounded again (for a total of 2 roundings), or - the entire expression evaluated with higher precision and then only rounded once (if supported by hardware).
i8x16.laneselect(a: v128, b: v128, m: v128) -> v128
i16x8.laneselect(a: v128, b: v128, m: v128) -> v128
i32x4.laneselect(a: v128, b: v128, m: v128) -> v128
i64x2.laneselect(a: v128, b: v128, m: v128) -> v128
Select lanes from a
or b
based on masks in m
. If each lane-sized mask in m
has all bits set or all bits unset, these instructions behave the same as v128.bitselect
. Otherwise, the result is implementation defined.
def laneselect(a : v128, b : v128, m: v128, lanes : int):
result = [0] * lanes
for i in range(lanes):
mask = m[i]
if mask == ~0:
result[i] = a[i]
elif mask == 0:
result[i] = b[i]
else topbit(mask) == 1:
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(bitselect(a[i], b[i], mask), a[i])
else:
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(bitselect(a[i], b[i], mask), b[i])
return result
f32x4.min(a: v128, b: v128) -> v128
f32x4.max(a: v128, b: v128) -> v128
f64x2.min(a: v128, b: v128) -> v128
f64x2.max(a: v128, b: v128) -> v128
Return the lane-wise minimum or maximum of two values. If either values is NaN, or the values are -0.0 and +0.0, the return value is implementation-defined.
def min_or_max(a : v128, b : v128, lanes : int, is_min : bool):
result = []
for i in range(lanes):
if isNaN(a[i]) or isNaN(b[i]):
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(a[i], b[i])
elif (a[i] == -0.0 && b[i] == +0.0) or (a[i] == +0.0 && b[i] == -0.0):
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(a[i], b[i])
else:
result[i] = is_min ? min(a, b) : max(a, b)
return result
i16x8.q15mulr_s(a: v128, b: v128) -> v128
Returns the multiplication of 2 fixed-point numbers in Q15 format. If both
inputs are INT16_MIN
, the result overflows, and the return value is
implementation defined (either INT16_MIN
or INT16_MAX
).
def q15mulr(a, b):
result = []
for i in range(lanes):
if (a[i] == INT16_MIN && b[i] == INT16_MIN):
result[i] = IMPLEMENTATION_DEFINED_ONE_OF(INT16_MIN, INT16_MAX)
else:
result[i] = (a[i] * b[i] + 0x4000) >> 15
return result
i16x8.dot_i8x16_i7x16_s(a: v128, b: v128) -> v128
i32x4.dot_i8x16_i7x16_add_s(a: v128, b:v128, c:v128) -> v128
Returns the multiplication of 8-bit elements (signed or unsigned) by 7-bit
elements (unsigned) with accumulation of adjacent products. The i32x4
versions
allows for accumulation into another vector.
When the second operand of the product has the high bit set in a lane, that lane's result is implementation defined.
def i16x8_dot_i8x16_i7x16_s(a, b):
intermediate = []
result = []
for i in range(16):
if (b[i] & 0x80):
lhs = as_signed(a[i])
rhs = IMPLEMENTATION_DEFINED_ONE_OF(as_signed(b[i]), as_unsigned(b[i]))
intermediate[i] = lhs * rhs
else:
intermediate[i] = as_signed(a[i]) * b[i]
for i in range(0, 16, 2):
result[i/2] = IMPLEMENTATION_DEFINED_ONE_OF(
intermediate[i] + intermediate[i+1],
saturate(intermediate[i] + intermediate[i+1]))
def i32x4.dot_i8x16_i7x16_add_s(a, b, c):
intermediate = []
tmp = []
result = []
for i in range(16):
if (b[i] & 0x80):
lhs = as_signed(a[i])
rhs = IMPLEMENTATION_DEFINED_ONE_OF(as_signed(b[i]), as_unsigned(b[i]))
intermediate[i] = lhs * rhs
else:
intermediate[i] = as_signed(a[i]) * b[i]
for i in range(0, 16, 2):
tmp[i/2] = IMPLEMENTATION_DEFINED_ONE_OF(
intermediate[i] + intermediate[i+1],
saturate(intermediate[i] + intermediate[i+1]))
for i in range(0, 8, 2):
dst[i/4] = tmp[i] + tmp[i+1]
for i in range(0, 4):
dst[i] += c[i]
saturate(x) = min(INT16_MAX, max(INT16_MIN, x))
All opcodes have the 0xfd
prefix (same as SIMD proposal), which are omitted in the table below.
Opcodes 0x100
to 0x12F
(32 opcodes) are reserved for this proposal.
Note: "prototype opcode" refers to the opcodes that were used in prototyping, which where chosen to fit into the holes in the opcode space of SIMD proposal. Going forward, the opcodes for relaxed-simd specification will be the ones in the "opcode" column, and it will take some time for tools and engines to update.
instruction | opcode | prototype opcode |
---|---|---|
i8x16.relaxed_swizzle |
0x100 | 0xa2 |
i32x4.relaxed_trunc_f32x4_s |
0x101 | 0xa5 |
i32x4.relaxed_trunc_f32x4_u |
0x102 | 0xa6 |
i32x4.relaxed_trunc_f64x2_s_zero |
0x103 | 0xc5 |
i32x4.relaxed_trunc_f64x2_u_zero |
0x104 | 0xc6 |
f32x4.relaxed_madd |
0x105 | 0xaf |
f32x4.relaxed_nmadd |
0x106 | 0xb0 |
f64x2.relaxed_madd |
0x107 | 0xcf |
f64x2.relaxed_nmadd |
0x108 | 0xd0 |
i8x16.relaxed_laneselect |
0x109 | 0xb2 |
i16x8.relaxed_laneselect |
0x10a | 0xb3 |
i32x4.relaxed_laneselect |
0x10b | 0xd2 |
i64x2.relaxed_laneselect |
0x10c | 0xd3 |
f32x4.relaxed_min |
0x10d | 0xb4 |
f32x4.relaxed_max |
0x10e | 0xe2 |
f64x2.relaxed_min |
0x10f | 0xd4 |
f64x2.relaxed_max |
0x110 | 0xee |
i16x8.relaxed_q15mulr_s |
0x111 | 0x111 |
i16x8.relaxed_dot_i8x16_i7x16_s |
0x112 | 0x112 |
i32x4.relaxed_dot_i8x16_i7x16_add_s |
0x113 | 0x113 |
Reserved | 0x114 - 0x12F |
- Poll for phase 1 presentation and meeting notes
- Poll for phase 2 slides and meeting notes
- Poll for phase 3 slides and meeting notes
- SIMD proposal