.. index:: value, integer, floating-point, bit width, determinism, non-determinism, NaN
Numeric primitives are defined in a generic manner, by operators indexed over a bit width N.
Some operators are non-deterministic, because they can return one of several possible results (such as different :ref:`NaN <syntax-nan>` values). Technically, each operator thus returns a set of allowed values. For convenience, deterministic results are expressed as plain values, which are assumed to be identified with a respective singleton set.
Some operators are partial, because they are not defined on certain inputs. Technically, an empty set of results is returned for these inputs.
In formal notation, each operator is defined by equational clauses that apply in decreasing order of precedence. That is, the first clause that is applicable to the given arguments defines the result. In some cases, similar clauses are combined into one by using the notation \pm or \mp. When several of these placeholders occur in a single clause, then they must be resolved consistently: either the upper sign is chosen for all of them or the lower sign.
Note
For example, the |fcopysign| operator is defined as follows:
\begin{array}{@{}lcll} \fcopysign_N(\pm p_1, \pm p_2) &=& \pm p_1 \\ \fcopysign_N(\pm p_1, \mp p_2) &=& \mp p_1 \\ \end{array}
This definition is to be read as a shorthand for the following expansion of each clause into two separate ones:
\begin{array}{@{}lcll} \fcopysign_N(+ p_1, + p_2) &=& + p_1 \\ \fcopysign_N(- p_1, - p_2) &=& - p_1 \\ \fcopysign_N(+ p_1, - p_2) &=& - p_1 \\ \fcopysign_N(- p_1, + p_2) &=& + p_1 \\ \end{array}
Numeric operators are lifted to input sequences by applying the operator element-wise, returning a sequence of results. When there are multiple inputs, they must be of equal length.
\begin{array}{lll@{\qquad}l} op(c_1^n, \dots, c_k^n) &=& op(c_1^n[0], \dots, c_k^n[0])~\dots~op(c_1^n[n-1], \dots, c_k^n[n-1]) \end{array}
Note
For example, the unary operator |fabs|, when given a sequence of floating-point values, return a sequence of floating-point results:
\begin{array}{lll@{\qquad}l} \fabs_N(z^n) &=& \fabs_N(z[0])~\dots~\fabs_N(z[n]) \end{array}
The binary operator |iadd|, when given two sequences of integers of the same length, n, return a sequence of integer results:
\begin{array}{lll@{\qquad}l} \iadd_N(i_1^n, i_2^n) &=& \iadd_N(i_1[0], i_2[0])~\dots~\iadd_N(i_1[n], i_2[n]) \end{array}
Conventions:
The meta variable d is used to range over single bits.
The meta variable p is used to range over (signless) :ref:`magnitudes <syntax-float>` of floating-point values, including |NAN| and \infty.
The meta variable q is used to range over (signless) rational :ref:`magnitudes <syntax-float>`, excluding |NAN| or \infty.
The notation f^{-1} denotes the inverse of a bijective function f.
Truncation of rational values is written \trunc(\pm q), with the usual mathematical definition:
\begin{array}{lll@{\qquad}l} \trunc(\pm q) &=& \pm i & (\iff i \in \mathbb{N} \wedge +q - 1 < i \leq +q) \\ \end{array}
Saturation of integers is written \satu_N(i) and \sats_N(i). The arguments to these two functions range over arbitrary signed integers.
Unsigned saturation, \satu_N(i) clamps i to between 0 and 2^N-1:
\begin{array}{lll@{\qquad}l} \satu_N(i) &=& 2^N-1 & (\iff i > 2^N-1)\\ \satu_N(i) &=& 0 & (\iff i < 0) \\ \satu_N(i) &=& i & (\otherwise) \\ \end{array}
Signed saturation, \sats_N(i) clamps i to between -2^{N-1} and 2^{N-1}-1:
\begin{array}{lll@{\qquad}l} \sats_N(i) &=& \signed_N^{-1}(-2^{N-1}) & (\iff i < -2^{N-1})\\ \sats_N(i) &=& \signed_N^{-1}(2^{N-1}-1) & (\iff i > 2^{N-1}-1)\\ \sats_N(i) &=& i & (\otherwise) \end{array}
.. index:: bit, integer, floating-point
Numbers have an underlying binary representation as a sequence of bits:
\begin{array}{lll@{\qquad}l} \bits_{\K{i}N}(i) &=& \ibits_N(i) \\ \bits_{\K{f}N}(z) &=& \fbits_N(z) \\ \end{array}
Each of these functions is a bijection, hence they are invertible.
.. index:: Boolean
:ref:`Integers <syntax-int>` are represented as base two unsigned numbers:
\begin{array}{lll@{\qquad}l} \ibits_N(i) &=& d_{N-1}~\dots~d_0 & (i = 2^{N-1}\cdot d_{N-1} + \dots + 2^0\cdot d_0) \\ \end{array}
Boolean operators like \wedge, \vee, or \veebar are lifted to bit sequences of equal length by applying them pointwise.
.. index:: IEEE 754, significand, exponent
:ref:`Floating-point values <syntax-float>` are represented in the respective binary format defined by |IEEE754|_ (Section 3.4):
\begin{array}{lll@{\qquad}l} \fbits_N(\pm (1+m\cdot 2^{-M})\cdot 2^e) &=& \fsign({\pm})~\ibits_E(e+\fbias_N)~\ibits_M(m) \\ \fbits_N(\pm (0+m\cdot 2^{-M})\cdot 2^e) &=& \fsign({\pm})~(0)^E~\ibits_M(m) \\ \fbits_N(\pm \infty) &=& \fsign({\pm})~(1)^E~(0)^M \\ \fbits_N(\pm \NAN(n)) &=& \fsign({\pm})~(1)^E~\ibits_M(n) \\[1ex] \fbias_N &=& 2^{E-1}-1 \\ \fsign({+}) &=& 0 \\ \fsign({-}) &=& 1 \\ \end{array}
where M = \significand(N) and E = \exponent(N).
.. index:: byte, little endian, memory
When a number is stored into :ref:`memory <syntax-mem>`, it is converted into a sequence of :ref:`bytes <syntax-byte>` in |LittleEndian|_ byte order:
\begin{array}{lll@{\qquad}l} \bytes_t(i) &=& \littleendian(\bits_t(i)) \\[1ex] \littleendian(\epsilon) &=& \epsilon \\ \littleendian(d^8~{d'}^\ast~) &=& \littleendian({d'}^\ast)~\ibits_8^{-1}(d^8) \\ \end{array}
Again these functions are invertible bijections.
.. index:: numeric vectors, shape
Numeric vectors have the same underlying representation as an |i128|. They can also be interpreted as a sequence of numeric values packed into a |V128| with a particular |shape|.
\begin{array}{l} \begin{array}{lll@{\qquad}l} \lanes_{t\K{x}N}(c) &=& c_0~\dots~c_{N-1} \\ \end{array} \\ \qquad \begin{array}[t]{@{}r@{~}l@{}} (\where & B = |t| / 8 \\ \wedge & b^{16} = \bytes_{\i128}(c) \\ \wedge & c_i = \bytes_{t}^{-1}(b^{16}[i \cdot B \slice B])) \end{array} \end{array}
These functions are bijections, so they are invertible.
.. index:: integer
.. index:: sign, signed integer, unsigned integer, uninterpreted integer, two's complement
Integer operators are defined on |iN| values. Operators that use a signed interpretation convert the value using the following definition, which takes the two's complement when the value lies in the upper half of the value range (i.e., its most significant bit is 1):
\begin{array}{lll@{\qquad}l} \signed_N(i) &=& i & (0 \leq i < 2^{N-1}) \\ \signed_N(i) &=& i - 2^N & (2^{N-1} \leq i < 2^N) \\ \end{array}
This function is bijective, and hence invertible.
.. index:: Boolean
The integer result of predicates -- i.e., :ref:`tests <syntax-testop>` and :ref:`relational <syntax-relop>` operators -- is defined with the help of the following auxiliary function producing the value 1 or 0 depending on a condition.
\begin{array}{lll@{\qquad}l} \bool(C) &=& 1 & (\iff C) \\ \bool(C) &=& 0 & (\otherwise) \\ \end{array}
- Return the result of adding i_1 and i_2 modulo 2^N.
\begin{array}{@{}lcll} \iadd_N(i_1, i_2) &=& (i_1 + i_2) \mod 2^N \end{array}
- Return the result of subtracting i_2 from i_1 modulo 2^N.
\begin{array}{@{}lcll} \isub_N(i_1, i_2) &=& (i_1 - i_2 + 2^N) \mod 2^N \end{array}
- Return the result of multiplying i_1 and i_2 modulo 2^N.
\begin{array}{@{}lcll} \imul_N(i_1, i_2) &=& (i_1 \cdot i_2) \mod 2^N \end{array}
- If i_2 is 0, then the result is undefined.
- Else, return the result of dividing i_1 by i_2, truncated toward zero.
\begin{array}{@{}lcll} \idivu_N(i_1, 0) &=& \{\} \\ \idivu_N(i_1, i_2) &=& \trunc(i_1 / i_2) \\ \end{array}
Note
This operator is :ref:`partial <exec-op-partial>`.
- Let j_1 be the :ref:`signed interpretation <aux-signed>` of i_1.
- Let j_2 be the :ref:`signed interpretation <aux-signed>` of i_2.
- If j_2 is 0, then the result is undefined.
- Else if j_1 divided by j_2 is 2^{N-1}, then the result is undefined.
- Else, return the result of dividing j_1 by j_2, truncated toward zero.
\begin{array}{@{}lcll} \idivs_N(i_1, 0) &=& \{\} \\ \idivs_N(i_1, i_2) &=& \{\} \qquad\qquad (\iff \signed_N(i_1) / \signed_N(i_2) = 2^{N-1}) \\ \idivs_N(i_1, i_2) &=& \signed_N^{-1}(\trunc(\signed_N(i_1) / \signed_N(i_2))) \\ \end{array}
Note
This operator is :ref:`partial <exec-op-partial>`. Besides division by 0, the result of (-2^{N-1})/(-1) = +2^{N-1} is not representable as an N-bit signed integer.
- If i_2 is 0, then the result is undefined.
- Else, return the remainder of dividing i_1 by i_2.
\begin{array}{@{}lcll} \iremu_N(i_1, 0) &=& \{\} \\ \iremu_N(i_1, i_2) &=& i_1 - i_2 \cdot \trunc(i_1 / i_2) \\ \end{array}
Note
This operator is :ref:`partial <exec-op-partial>`.
As long as both operators are defined, it holds that i_1 = i_2\cdot\idivu(i_1, i_2) + \iremu(i_1, i_2).
- Let j_1 be the :ref:`signed interpretation <aux-signed>` of i_1.
- Let j_2 be the :ref:`signed interpretation <aux-signed>` of i_2.
- If i_2 is 0, then the result is undefined.
- Else, return the remainder of dividing j_1 by j_2, with the sign of the dividend j_1.
\begin{array}{@{}lcll} \irems_N(i_1, 0) &=& \{\} \\ \irems_N(i_1, i_2) &=& \signed_N^{-1}(j_1 - j_2 \cdot \trunc(j_1 / j_2)) \\ && (\where j_1 = \signed_N(i_1) \wedge j_2 = \signed_N(i_2)) \\ \end{array}
Note
This operator is :ref:`partial <exec-op-partial>`.
As long as both operators are defined, it holds that i_1 = i_2\cdot\idivs(i_1, i_2) + \irems(i_1, i_2).
- Return the bitwise negation of i.
\begin{array}{@{}lcll} \inot_N(i) &=& \ibits_N^{-1}(\ibits_N(i) \veebar \ibits_N(2^N-1)) \end{array}
- Return the bitwise conjunction of i_1 and i_2.
\begin{array}{@{}lcll} \iand_N(i_1, i_2) &=& \ibits_N^{-1}(\ibits_N(i_1) \wedge \ibits_N(i_2)) \end{array}
- Return the bitwise conjunction of i_1 and the bitwise negation of i_2.
\begin{array}{@{}lcll} \iandnot_N(i_1, i_2) &=& \iand_N(i_1, \inot_N(i_2)) \end{array}
- Return the bitwise disjunction of i_1 and i_2.
\begin{array}{@{}lcll} \ior_N(i_1, i_2) &=& \ibits_N^{-1}(\ibits_N(i_1) \vee \ibits_N(i_2)) \end{array}
- Return the bitwise exclusive disjunction of i_1 and i_2.
\begin{array}{@{}lcll} \ixor_N(i_1, i_2) &=& \ibits_N^{-1}(\ibits_N(i_1) \veebar \ibits_N(i_2)) \end{array}
- Let k be i_2 modulo N.
- Return the result of shifting i_1 left by k bits, modulo 2^N.
\begin{array}{@{}lcll} \ishl_N(i_1, i_2) &=& \ibits_N^{-1}(d_2^{N-k}~0^k) & (\iff \ibits_N(i_1) = d_1^k~d_2^{N-k} \wedge k = i_2 \mod N) \end{array}
- Let k be i_2 modulo N.
- Return the result of shifting i_1 right by k bits, extended with 0 bits.
\begin{array}{@{}lcll} \ishru_N(i_1, i_2) &=& \ibits_N^{-1}(0^k~d_1^{N-k}) & (\iff \ibits_N(i_1) = d_1^{N-k}~d_2^k \wedge k = i_2 \mod N) \end{array}
- Let k be i_2 modulo N.
- Return the result of shifting i_1 right by k bits, extended with the most significant bit of the original value.
\begin{array}{@{}lcll} \ishrs_N(i_1, i_2) &=& \ibits_N^{-1}(d_0^{k+1}~d_1^{N-k-1}) & (\iff \ibits_N(i_1) = d_0~d_1^{N-k-1}~d_2^k \wedge k = i_2 \mod N) \end{array}
- Let k be i_2 modulo N.
- Return the result of rotating i_1 left by k bits.
\begin{array}{@{}lcll} \irotl_N(i_1, i_2) &=& \ibits_N^{-1}(d_2^{N-k}~d_1^k) & (\iff \ibits_N(i_1) = d_1^k~d_2^{N-k} \wedge k = i_2 \mod N) \end{array}
- Let k be i_2 modulo N.
- Return the result of rotating i_1 right by k bits.
\begin{array}{@{}lcll} \irotr_N(i_1, i_2) &=& \ibits_N^{-1}(d_2^k~d_1^{N-k}) & (\iff \ibits_N(i_1) = d_1^{N-k}~d_2^k \wedge k = i_2 \mod N) \end{array}
- Return the count of leading zero bits in i; all bits are considered leading zeros if i is 0.
\begin{array}{@{}lcll} \iclz_N(i) &=& k & (\iff \ibits_N(i) = 0^k~(1~d^\ast)^?) \end{array}
- Return the count of trailing zero bits in i; all bits are considered trailing zeros if i is 0.
\begin{array}{@{}lcll} \ictz_N(i) &=& k & (\iff \ibits_N(i) = (d^\ast~1)^?~0^k) \end{array}
- Return the count of non-zero bits in i.
\begin{array}{@{}lcll} \ipopcnt_N(i) &=& k & (\iff \ibits_N(i) = (0^\ast~1)^k~0^\ast) \end{array}
- Return 1 if i is zero, 0 otherwise.
\begin{array}{@{}lcll} \ieqz_N(i) &=& \bool(i = 0) \end{array}
- Return 1 if i_1 equals i_2, 0 otherwise.
\begin{array}{@{}lcll} \ieq_N(i_1, i_2) &=& \bool(i_1 = i_2) \end{array}
- Return 1 if i_1 does not equal i_2, 0 otherwise.
\begin{array}{@{}lcll} \ine_N(i_1, i_2) &=& \bool(i_1 \neq i_2) \end{array}
- Return 1 if i_1 is less than i_2, 0 otherwise.
\begin{array}{@{}lcll} \iltu_N(i_1, i_2) &=& \bool(i_1 < i_2) \end{array}
- Let j_1 be the :ref:`signed interpretation <aux-signed>` of i_1.
- Let j_2 be the :ref:`signed interpretation <aux-signed>` of i_2.
- Return 1 if j_1 is less than j_2, 0 otherwise.
\begin{array}{@{}lcll} \ilts_N(i_1, i_2) &=& \bool(\signed_N(i_1) < \signed_N(i_2)) \end{array}
- Return 1 if i_1 is greater than i_2, 0 otherwise.
\begin{array}{@{}lcll} \igtu_N(i_1, i_2) &=& \bool(i_1 > i_2) \end{array}
- Let j_1 be the :ref:`signed interpretation <aux-signed>` of i_1.
- Let j_2 be the :ref:`signed interpretation <aux-signed>` of i_2.
- Return 1 if j_1 is greater than j_2, 0 otherwise.
\begin{array}{@{}lcll} \igts_N(i_1, i_2) &=& \bool(\signed_N(i_1) > \signed_N(i_2)) \end{array}
- Return 1 if i_1 is less than or equal to i_2, 0 otherwise.
\begin{array}{@{}lcll} \ileu_N(i_1, i_2) &=& \bool(i_1 \leq i_2) \end{array}
- Let j_1 be the :ref:`signed interpretation <aux-signed>` of i_1.
- Let j_2 be the :ref:`signed interpretation <aux-signed>` of i_2.
- Return 1 if j_1 is less than or equal to j_2, 0 otherwise.
\begin{array}{@{}lcll} \iles_N(i_1, i_2) &=& \bool(\signed_N(i_1) \leq \signed_N(i_2)) \end{array}
- Return 1 if i_1 is greater than or equal to i_2, 0 otherwise.
\begin{array}{@{}lcll} \igeu_N(i_1, i_2) &=& \bool(i_1 \geq i_2) \end{array}
- Let j_1 be the :ref:`signed interpretation <aux-signed>` of i_1.
- Let j_2 be the :ref:`signed interpretation <aux-signed>` of i_2.
- Return 1 if j_1 is greater than or equal to j_2, 0 otherwise.
\begin{array}{@{}lcll} \iges_N(i_1, i_2) &=& \bool(\signed_N(i_1) \geq \signed_N(i_2)) \end{array}
- Return \extends_{M,N}(i).
\begin{array}{lll@{\qquad}l} \iextendMs_{N}(i) &=& \extends_{M,N}(i) \\ \end{array}
- Let j_1 be the bitwise conjunction of i_1 and i_3.
- Let j_3' be the bitwise negation of i_3.
- Let j_2 be the bitwise conjunction of i_2 and j_3'.
- Return the bitwise disjunction of j_1 and j_2.
\begin{array}{@{}lcll} \ibitselect_N(i_1, i_2, i_3) &=& \ior_N(\iand_N(i_1, i_3), \iand_N(i_2, \inot_N(i_3))) \end{array}
- Let j be the :ref:`signed interpretation <aux-signed>` of i.
- If j is greater than or equal to 0, then return i.
- Else return the negation of j, modulo 2^N.
\begin{array}{@{}lcll} \iabs_N(i) &=& i & (\iff \signed_N(i) \ge 0) \\ \iabs_N(i) &=& -\signed_N(i) \mod 2^N & (\otherwise) \\ \end{array}
- Return the result of negating i, modulo 2^N.
\begin{array}{@{}lcll} \ineg_N(i) &=& (2^N - i) \mod 2^N \end{array}
- Return i_1 if \iltu_N(i_1, i_2) is 1, return i_2 otherwise.
\begin{array}{@{}lcll} \iminu_N(i_1, i_2) &=& i_1 & (\iff \iltu_N(i_1, i_2) = 1)\\ \iminu_N(i_1, i_2) &=& i_2 & (\otherwise) \end{array}
- Return i_1 if \ilts_N(i_1, i_2) is 1, return i_2 otherwise.
\begin{array}{@{}lcll} \imins_N(i_1, i_2) &=& i_1 & (\iff \ilts_N(i_1, i_2) = 1)\\ \imins_N(i_1, i_2) &=& i_2 & (\otherwise) \end{array}
- Return i_1 if \igtu_N(i_1, i_2) is 1, return i_2 otherwise.
\begin{array}{@{}lcll} \imaxu_N(i_1, i_2) &=& i_1 & (\iff \igtu_N(i_1, i_2) = 1)\\ \imaxu_N(i_1, i_2) &=& i_2 & (\otherwise) \end{array}
- Return i_1 if \igts_N(i_1, i_2) is 1, return i_2 otherwise.
\begin{array}{@{}lcll} \imaxs_N(i_1, i_2) &=& i_1 & (\iff \igts_N(i_1, i_2) = 1)\\ \imaxs_N(i_1, i_2) &=& i_2 & (\otherwise) \end{array}
- Let i be the result of adding i_1 and i_2.
- Return \satu_N(i).
\begin{array}{lll@{\qquad}l} \iaddsatu_N(i_1, i_2) &=& \satu_N(i_1 + i_2) \end{array}
- Let j_1 be the signed interpretation of i_1
- Let j_2 be the signed interpretation of i_2
- Let j be the result of adding j_1 and j_2.
- Return \sats_N(j).
\begin{array}{lll@{\qquad}l} \iaddsats_N(i_1, i_2) &=& \sats_N(\signed_N(i_1) + \signed_N(i_2)) \end{array}
- Let i be the result of subtracting i_2 from i_1.
- Return \satu_N(i).
\begin{array}{lll@{\qquad}l} \isubsatu_N(i_1, i_2) &=& \satu_N(i_1 - i_2) \end{array}
- Let j_1 be the signed interpretation of i_1
- Let j_2 be the signed interpretation of i_2
- Let j be the result of subtracting j_2 from j_1.
- Return \sats_N(j).
\begin{array}{lll@{\qquad}l} \isubsats_N(i_1, i_2) &=& \sats_N(\signed_N(i_1) - \signed_N(i_2)) \end{array}
- Let j be the result of adding i_1, i_2, and 1.
- Return the result of dividing j by 2, truncated toward zero.
\begin{array}{lll@{\qquad}l} \iavgru_N(i_1, i_2) &=& \trunc((i_1 + i_2 + 1) / 2) \end{array}
- Return the result of \sats_N(\ishrs_N(i_1 \cdot i_2 + 2^{14}, 15)).
\begin{array}{lll@{\qquad}l} \iq15mulrsats_N(i_1, i_2) &=& \sats_N(\ishrs_N(i_1 \cdot i_2 + 2^{14}, 15)) \end{array}
.. index:: floating-point, IEEE 754
Floating-point arithmetic follows the |IEEE754|_ standard, with the following qualifications:
- All operators use round-to-nearest ties-to-even, except where otherwise specified. Non-default directed rounding attributes are not supported.
- Following the recommendation that operators propagate :ref:`NaN <syntax-nan>` payloads from their operands is permitted but not required.
- All operators use "non-stop" mode, and floating-point exceptions are not otherwise observable. In particular, neither alternate floating-point exception handling attributes nor operators on status flags are supported. There is no observable difference between quiet and signalling NaNs.
Note
Some of these limitations may be lifted in future versions of WebAssembly.
.. index:: rounding
Rounding always is round-to-nearest ties-to-even, in correspondence with |IEEE754|_ (Section 4.3.1).
An exact floating-point number is a rational number that is exactly representable as a :ref:`floating-point number <syntax-float>` of given bit width N.
A limit number for a given floating-point bit width N is a positive or negative number whose magnitude is the smallest power of 2 that is not exactly representable as a floating-point number of width N (that magnitude is 2^{128} for N = 32 and 2^{1024} for N = 64).
A candidate number is either an exact floating-point number or a positive or negative limit number for the given bit width N.
A candidate pair is a pair z_1,z_2 of candidate numbers, such that no candidate number exists that lies between the two.
A real number r is converted to a floating-point value of bit width N as follows:
- If r is 0, then return +0.
- Else if r is an exact floating-point number, then return r.
- Else if r greater than or equal to the positive limit, then return +\infty.
- Else if r is less than or equal to the negative limit, then return -\infty.
- Else if z_1 and z_2 are a candidate pair such that z_1 < r < z_2, then:
- If |r - z_1| < |r - z_2|, then let z be z_1.
- Else if |r - z_1| > |r - z_2|, then let z be z_2.
- Else if |r - z_1| = |r - z_2| and the :ref:`significand <syntax-float>` of z_1 is even, then let z be z_1.
- Else, let z be z_2.
- If z is 0, then:
- If r < 0, then return -0.
- Else, return +0.
- Else if z is a limit number, then:
- If r < 0, then return -\infty.
- Else, return +\infty.
- Else, return z.
\begin{array}{lll@{\qquad}l} \ieee_N(0) &=& +0 \\ \ieee_N(r) &=& r & (\iff r \in \F{exact}_N) \\ \ieee_N(r) &=& +\infty & (\iff r \geq +\F{limit}_N) \\ \ieee_N(r) &=& -\infty & (\iff r \leq -\F{limit}_N) \\ \ieee_N(r) &=& \F{closest}_N(r, z_1, z_2) & (\iff z_1 < r < z_2 \wedge (z_1,z_2) \in \F{candidatepair}_N) \\[1ex] \F{closest}_N(r, z_1, z_2) &=& \F{rectify}_N(r, z_1) & (\iff |r-z_1|<|r-z_2|) \\ \F{closest}_N(r, z_1, z_2) &=& \F{rectify}_N(r, z_2) & (\iff |r-z_1|>|r-z_2|) \\ \F{closest}_N(r, z_1, z_2) &=& \F{rectify}_N(r, z_1) & (\iff |r-z_1|=|r-z_2| \wedge \F{even}_N(z_1)) \\ \F{closest}_N(r, z_1, z_2) &=& \F{rectify}_N(r, z_2) & (\iff |r-z_1|=|r-z_2| \wedge \F{even}_N(z_2)) \\[1ex] \F{rectify}_N(r, \pm \F{limit}_N) &=& \pm \infty \\ \F{rectify}_N(r, 0) &=& +0 \qquad (r \geq 0) \\ \F{rectify}_N(r, 0) &=& -0 \qquad (r < 0) \\ \F{rectify}_N(r, z) &=& z \\ \end{array}
where:
\begin{array}{lll@{\qquad}l} \F{exact}_N &=& \fN \cap \mathbb{Q} \\ \F{limit}_N &=& 2^{2^{\exponent(N)-1}} \\ \F{candidate}_N &=& \F{exact}_N \cup \{+\F{limit}_N, -\F{limit}_N\} \\ \F{candidatepair}_N &=& \{ (z_1, z_2) \in \F{candidate}_N^2 ~|~ z_1 < z_2 \wedge \forall z \in \F{candidate}_N, z \leq z_1 \vee z \geq z_2\} \\[1ex] \F{even}_N((d + m\cdot 2^{-M}) \cdot 2^e) &\Leftrightarrow& m \mod 2 = 0 \\ \F{even}_N(\pm \F{limit}_N) &\Leftrightarrow& \F{true} \\ \end{array}
.. index:: NaN, determinism, non-determinism
When the result of a floating-point operator other than |fneg|, |fabs|, or |fcopysign| is a :ref:`NaN <syntax-nan>`, then its sign is non-deterministic and the :ref:`payload <syntax-payload>` is computed as follows:
- If the payload of all NaN inputs to the operator is :ref:`canonical <canonical-nan>` (including the case that there are no NaN inputs), then the payload of the output is canonical as well.
- Otherwise the payload is picked non-deterministically among all :ref:`arithmetic NaNs <arithmetic-nan>`; that is, its most significant bit is 1 and all others are unspecified.
- In the :ref:`deterministic profile <profile-deterministic>`, only positive canonical NaN outputs are produced.
This non-deterministic result is expressed by the following auxiliary function producing a set of allowed outputs from a set of inputs:
\begin{array}{llcl@{\qquad}l} & \nans_N\{z^\ast\} &=& \{ + \NAN(\canon_N) \} \\ \exprofiles{\PROFDET} & \nans_N\{z^\ast\} &=& \{ + \NAN(n), - \NAN(n) ~|~ n = \canon_N \} & (\iff \forall \,{\pm \NAN(n)} \in z^\ast,~ n = \canon_N) \\ \exprofiles{\PROFDET} & \nans_N\{z^\ast\} &=& \{ + \NAN(n), - \NAN(n) ~|~ n \geq \canon_N \} & (\iff \exists \,{\pm \NAN(n)} \in z^\ast,~ n \neq \canon_N) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return an element of \nans_N\{z_1, z_2\}.
- Else if both z_1 and z_2 are infinities of opposite signs, then return an element of \nans_N\{\}.
- Else if both z_1 and z_2 are infinities of equal sign, then return that infinity.
- Else if either z_1 or z_2 is an infinity, then return that infinity.
- Else if both z_1 and z_2 are zeroes of opposite sign, then return positive zero.
- Else if both z_1 and z_2 are zeroes of equal sign, then return that zero.
- Else if either z_1 or z_2 is a zero, then return the other operand.
- Else if both z_1 and z_2 are values with the same magnitude but opposite signs, then return positive zero.
- Else return the result of adding z_1 and z_2, :ref:`rounded <aux-ieee>` to the nearest representable value.
\begin{array}{@{}lcll} \fadd_N(\pm \NAN(n), z_2) &=& \nans_N\{\pm \NAN(n), z_2\} \\ \fadd_N(z_1, \pm \NAN(n)) &=& \nans_N\{\pm \NAN(n), z_1\} \\ \fadd_N(\pm \infty, \mp \infty) &=& \nans_N\{\} \\ \fadd_N(\pm \infty, \pm \infty) &=& \pm \infty \\ \fadd_N(z_1, \pm \infty) &=& \pm \infty \\ \fadd_N(\pm \infty, z_2) &=& \pm \infty \\ \fadd_N(\pm 0, \mp 0) &=& +0 \\ \fadd_N(\pm 0, \pm 0) &=& \pm 0 \\ \fadd_N(z_1, \pm 0) &=& z_1 \\ \fadd_N(\pm 0, z_2) &=& z_2 \\ \fadd_N(\pm q, \mp q) &=& +0 \\ \fadd_N(z_1, z_2) &=& \ieee_N(z_1 + z_2) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return an element of \nans_N\{z_1, z_2\}.
- Else if both z_1 and z_2 are infinities of equal signs, then return an element of \nans_N\{\}.
- Else if both z_1 and z_2 are infinities of opposite sign, then return z_1.
- Else if z_1 is an infinity, then return that infinity.
- Else if z_2 is an infinity, then return that infinity negated.
- Else if both z_1 and z_2 are zeroes of equal sign, then return positive zero.
- Else if both z_1 and z_2 are zeroes of opposite sign, then return z_1.
- Else if z_2 is a zero, then return z_1.
- Else if z_1 is a zero, then return z_2 negated.
- Else if both z_1 and z_2 are the same value, then return positive zero.
- Else return the result of subtracting z_2 from z_1, :ref:`rounded <aux-ieee>` to the nearest representable value.
\begin{array}{@{}lcll} \fsub_N(\pm \NAN(n), z_2) &=& \nans_N\{\pm \NAN(n), z_2\} \\ \fsub_N(z_1, \pm \NAN(n)) &=& \nans_N\{\pm \NAN(n), z_1\} \\ \fsub_N(\pm \infty, \pm \infty) &=& \nans_N\{\} \\ \fsub_N(\pm \infty, \mp \infty) &=& \pm \infty \\ \fsub_N(z_1, \pm \infty) &=& \mp \infty \\ \fsub_N(\pm \infty, z_2) &=& \pm \infty \\ \fsub_N(\pm 0, \pm 0) &=& +0 \\ \fsub_N(\pm 0, \mp 0) &=& \pm 0 \\ \fsub_N(z_1, \pm 0) &=& z_1 \\ \fsub_N(\pm 0, \pm q_2) &=& \mp q_2 \\ \fsub_N(\pm q, \pm q) &=& +0 \\ \fsub_N(z_1, z_2) &=& \ieee_N(z_1 - z_2) \\ \end{array}
Note
Up to the non-determinism regarding NaNs, it always holds that \fsub_N(z_1, z_2) = \fadd_N(z_1, \fneg_N(z_2)).
- If either z_1 or z_2 is a NaN, then return an element of \nans_N\{z_1, z_2\}.
- Else if one of z_1 and z_2 is a zero and the other an infinity, then return an element of \nans_N\{\}.
- Else if both z_1 and z_2 are infinities of equal sign, then return positive infinity.
- Else if both z_1 and z_2 are infinities of opposite sign, then return negative infinity.
- Else if either z_1 or z_2 is an infinity and the other a value with equal sign, then return positive infinity.
- Else if either z_1 or z_2 is an infinity and the other a value with opposite sign, then return negative infinity.
- Else if both z_1 and z_2 are zeroes of equal sign, then return positive zero.
- Else if both z_1 and z_2 are zeroes of opposite sign, then return negative zero.
- Else return the result of multiplying z_1 and z_2, :ref:`rounded <aux-ieee>` to the nearest representable value.
\begin{array}{@{}lcll} \fmul_N(\pm \NAN(n), z_2) &=& \nans_N\{\pm \NAN(n), z_2\} \\ \fmul_N(z_1, \pm \NAN(n)) &=& \nans_N\{\pm \NAN(n), z_1\} \\ \fmul_N(\pm \infty, \pm 0) &=& \nans_N\{\} \\ \fmul_N(\pm \infty, \mp 0) &=& \nans_N\{\} \\ \fmul_N(\pm 0, \pm \infty) &=& \nans_N\{\} \\ \fmul_N(\pm 0, \mp \infty) &=& \nans_N\{\} \\ \fmul_N(\pm \infty, \pm \infty) &=& +\infty \\ \fmul_N(\pm \infty, \mp \infty) &=& -\infty \\ \fmul_N(\pm q_1, \pm \infty) &=& +\infty \\ \fmul_N(\pm q_1, \mp \infty) &=& -\infty \\ \fmul_N(\pm \infty, \pm q_2) &=& +\infty \\ \fmul_N(\pm \infty, \mp q_2) &=& -\infty \\ \fmul_N(\pm 0, \pm 0) &=& + 0 \\ \fmul_N(\pm 0, \mp 0) &=& - 0 \\ \fmul_N(z_1, z_2) &=& \ieee_N(z_1 \cdot z_2) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return an element of \nans_N\{z_1, z_2\}.
- Else if both z_1 and z_2 are infinities, then return an element of \nans_N\{\}.
- Else if both z_1 and z_2 are zeroes, then return an element of \nans_N\{z_1, z_2\}.
- Else if z_1 is an infinity and z_2 a value with equal sign, then return positive infinity.
- Else if z_1 is an infinity and z_2 a value with opposite sign, then return negative infinity.
- Else if z_2 is an infinity and z_1 a value with equal sign, then return positive zero.
- Else if z_2 is an infinity and z_1 a value with opposite sign, then return negative zero.
- Else if z_1 is a zero and z_2 a value with equal sign, then return positive zero.
- Else if z_1 is a zero and z_2 a value with opposite sign, then return negative zero.
- Else if z_2 is a zero and z_1 a value with equal sign, then return positive infinity.
- Else if z_2 is a zero and z_1 a value with opposite sign, then return negative infinity.
- Else return the result of dividing z_1 by z_2, :ref:`rounded <aux-ieee>` to the nearest representable value.
\begin{array}{@{}lcll} \fdiv_N(\pm \NAN(n), z_2) &=& \nans_N\{\pm \NAN(n), z_2\} \\ \fdiv_N(z_1, \pm \NAN(n)) &=& \nans_N\{\pm \NAN(n), z_1\} \\ \fdiv_N(\pm \infty, \pm \infty) &=& \nans_N\{\} \\ \fdiv_N(\pm \infty, \mp \infty) &=& \nans_N\{\} \\ \fdiv_N(\pm 0, \pm 0) &=& \nans_N\{\} \\ \fdiv_N(\pm 0, \mp 0) &=& \nans_N\{\} \\ \fdiv_N(\pm \infty, \pm q_2) &=& +\infty \\ \fdiv_N(\pm \infty, \mp q_2) &=& -\infty \\ \fdiv_N(\pm q_1, \pm \infty) &=& +0 \\ \fdiv_N(\pm q_1, \mp \infty) &=& -0 \\ \fdiv_N(\pm 0, \pm q_2) &=& +0 \\ \fdiv_N(\pm 0, \mp q_2) &=& -0 \\ \fdiv_N(\pm q_1, \pm 0) &=& +\infty \\ \fdiv_N(\pm q_1, \mp 0) &=& -\infty \\ \fdiv_N(z_1, z_2) &=& \ieee_N(z_1 / z_2) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return an element of \nans_N\{z_1, z_2\}.
- Else if either z_1 or z_2 is a negative infinity, then return negative infinity.
- Else if either z_1 or z_2 is a positive infinity, then return the other value.
- Else if both z_1 and z_2 are zeroes of opposite signs, then return negative zero.
- Else return the smaller value of z_1 and z_2.
\begin{array}{@{}lcll} \fmin_N(\pm \NAN(n), z_2) &=& \nans_N\{\pm \NAN(n), z_2\} \\ \fmin_N(z_1, \pm \NAN(n)) &=& \nans_N\{\pm \NAN(n), z_1\} \\ \fmin_N(+ \infty, z_2) &=& z_2 \\ \fmin_N(- \infty, z_2) &=& - \infty \\ \fmin_N(z_1, + \infty) &=& z_1 \\ \fmin_N(z_1, - \infty) &=& - \infty \\ \fmin_N(\pm 0, \mp 0) &=& -0 \\ \fmin_N(z_1, z_2) &=& z_1 & (\iff z_1 \leq z_2) \\ \fmin_N(z_1, z_2) &=& z_2 & (\iff z_2 \leq z_1) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return an element of \nans_N\{z_1, z_2\}.
- Else if either z_1 or z_2 is a positive infinity, then return positive infinity.
- Else if either z_1 or z_2 is a negative infinity, then return the other value.
- Else if both z_1 and z_2 are zeroes of opposite signs, then return positive zero.
- Else return the larger value of z_1 and z_2.
\begin{array}{@{}lcll} \fmax_N(\pm \NAN(n), z_2) &=& \nans_N\{\pm \NAN(n), z_2\} \\ \fmax_N(z_1, \pm \NAN(n)) &=& \nans_N\{\pm \NAN(n), z_1\} \\ \fmax_N(+ \infty, z_2) &=& + \infty \\ \fmax_N(- \infty, z_2) &=& z_2 \\ \fmax_N(z_1, + \infty) &=& + \infty \\ \fmax_N(z_1, - \infty) &=& z_1 \\ \fmax_N(\pm 0, \mp 0) &=& +0 \\ \fmax_N(z_1, z_2) &=& z_1 & (\iff z_1 \geq z_2) \\ \fmax_N(z_1, z_2) &=& z_2 & (\iff z_2 \geq z_1) \\ \end{array}
- If z_1 and z_2 have the same sign, then return z_1.
- Else return z_1 with negated sign.
\begin{array}{@{}lcll} \fcopysign_N(\pm p_1, \pm p_2) &=& \pm p_1 \\ \fcopysign_N(\pm p_1, \mp p_2) &=& \mp p_1 \\ \end{array}
- If z is a NaN, then return z with positive sign.
- Else if z is an infinity, then return positive infinity.
- Else if z is a zero, then return positive zero.
- Else if z is a positive value, then z.
- Else return z negated.
\begin{array}{@{}lcll} \fabs_N(\pm \NAN(n)) &=& +\NAN(n) \\ \fabs_N(\pm \infty) &=& +\infty \\ \fabs_N(\pm 0) &=& +0 \\ \fabs_N(\pm q) &=& +q \\ \end{array}
- If z is a NaN, then return z with negated sign.
- Else if z is an infinity, then return that infinity negated.
- Else if z is a zero, then return that zero negated.
- Else return z negated.
\begin{array}{@{}lcll} \fneg_N(\pm \NAN(n)) &=& \mp \NAN(n) \\ \fneg_N(\pm \infty) &=& \mp \infty \\ \fneg_N(\pm 0) &=& \mp 0 \\ \fneg_N(\pm q) &=& \mp q \\ \end{array}
- If z is a NaN, then return an element of \nans_N\{z\}.
- Else if z is negative infinity, then return an element of \nans_N\{\}.
- Else if z is positive infinity, then return positive infinity.
- Else if z is a zero, then return that zero.
- Else if z has a negative sign, then return an element of \nans_N\{\}.
- Else return the square root of z.
\begin{array}{@{}lcll} \fsqrt_N(\pm \NAN(n)) &=& \nans_N\{\pm \NAN(n)\} \\ \fsqrt_N(- \infty) &=& \nans_N\{\} \\ \fsqrt_N(+ \infty) &=& + \infty \\ \fsqrt_N(\pm 0) &=& \pm 0 \\ \fsqrt_N(- q) &=& \nans_N\{\} \\ \fsqrt_N(+ q) &=& \ieee_N\left(\sqrt{q}\right) \\ \end{array}
- If z is a NaN, then return an element of \nans_N\{z\}.
- Else if z is an infinity, then return z.
- Else if z is a zero, then return z.
- Else if z is smaller than 0 but greater than -1, then return negative zero.
- Else return the smallest integral value that is not smaller than z.
\begin{array}{@{}lcll} \fceil_N(\pm \NAN(n)) &=& \nans_N\{\pm \NAN(n)\} \\ \fceil_N(\pm \infty) &=& \pm \infty \\ \fceil_N(\pm 0) &=& \pm 0 \\ \fceil_N(- q) &=& -0 & (\iff -1 < -q < 0) \\ \fceil_N(\pm q) &=& \ieee_N(i) & (\iff \pm q \leq i < \pm q + 1) \\ \end{array}
- If z is a NaN, then return an element of \nans_N\{z\}.
- Else if z is an infinity, then return z.
- Else if z is a zero, then return z.
- Else if z is greater than 0 but smaller than 1, then return positive zero.
- Else return the largest integral value that is not larger than z.
\begin{array}{@{}lcll} \ffloor_N(\pm \NAN(n)) &=& \nans_N\{\pm \NAN(n)\} \\ \ffloor_N(\pm \infty) &=& \pm \infty \\ \ffloor_N(\pm 0) &=& \pm 0 \\ \ffloor_N(+ q) &=& +0 & (\iff 0 < +q < 1) \\ \ffloor_N(\pm q) &=& \ieee_N(i) & (\iff \pm q - 1 < i \leq \pm q) \\ \end{array}
- If z is a NaN, then return an element of \nans_N\{z\}.
- Else if z is an infinity, then return z.
- Else if z is a zero, then return z.
- Else if z is greater than 0 but smaller than 1, then return positive zero.
- Else if z is smaller than 0 but greater than -1, then return negative zero.
- Else return the integral value with the same sign as z and the largest magnitude that is not larger than the magnitude of z.
\begin{array}{@{}lcll} \ftrunc_N(\pm \NAN(n)) &=& \nans_N\{\pm \NAN(n)\} \\ \ftrunc_N(\pm \infty) &=& \pm \infty \\ \ftrunc_N(\pm 0) &=& \pm 0 \\ \ftrunc_N(+ q) &=& +0 & (\iff 0 < +q < 1) \\ \ftrunc_N(- q) &=& -0 & (\iff -1 < -q < 0) \\ \ftrunc_N(\pm q) &=& \ieee_N(\pm i) & (\iff +q - 1 < i \leq +q) \\ \end{array}
- If z is a NaN, then return an element of \nans_N\{z\}.
- Else if z is an infinity, then return z.
- Else if z is a zero, then return z.
- Else if z is greater than 0 but smaller than or equal to 0.5, then return positive zero.
- Else if z is smaller than 0 but greater than or equal to -0.5, then return negative zero.
- Else return the integral value that is nearest to z; if two values are equally near, return the even one.
\begin{array}{@{}lcll} \fnearest_N(\pm \NAN(n)) &=& \nans_N\{\pm \NAN(n)\} \\ \fnearest_N(\pm \infty) &=& \pm \infty \\ \fnearest_N(\pm 0) &=& \pm 0 \\ \fnearest_N(+ q) &=& +0 & (\iff 0 < +q \leq 0.5) \\ \fnearest_N(- q) &=& -0 & (\iff -0.5 \leq -q < 0) \\ \fnearest_N(\pm q) &=& \ieee_N(\pm i) & (\iff |i - q| < 0.5) \\ \fnearest_N(\pm q) &=& \ieee_N(\pm i) & (\iff |i - q| = 0.5 \wedge i~\mbox{even}) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return 0.
- Else if both z_1 and z_2 are zeroes, then return 1.
- Else if both z_1 and z_2 are the same value, then return 1.
- Else return 0.
\begin{array}{@{}lcll} \feq_N(\pm \NAN(n), z_2) &=& 0 \\ \feq_N(z_1, \pm \NAN(n)) &=& 0 \\ \feq_N(\pm 0, \mp 0) &=& 1 \\ \feq_N(z_1, z_2) &=& \bool(z_1 = z_2) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return 1.
- Else if both z_1 and z_2 are zeroes, then return 0.
- Else if both z_1 and z_2 are the same value, then return 0.
- Else return 1.
\begin{array}{@{}lcll} \fne_N(\pm \NAN(n), z_2) &=& 1 \\ \fne_N(z_1, \pm \NAN(n)) &=& 1 \\ \fne_N(\pm 0, \mp 0) &=& 0 \\ \fne_N(z_1, z_2) &=& \bool(z_1 \neq z_2) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return 0.
- Else if z_1 and z_2 are the same value, then return 0.
- Else if z_1 is positive infinity, then return 0.
- Else if z_1 is negative infinity, then return 1.
- Else if z_2 is positive infinity, then return 1.
- Else if z_2 is negative infinity, then return 0.
- Else if both z_1 and z_2 are zeroes, then return 0.
- Else if z_1 is smaller than z_2, then return 1.
- Else return 0.
\begin{array}{@{}lcll} \flt_N(\pm \NAN(n), z_2) &=& 0 \\ \flt_N(z_1, \pm \NAN(n)) &=& 0 \\ \flt_N(z, z) &=& 0 \\ \flt_N(+ \infty, z_2) &=& 0 \\ \flt_N(- \infty, z_2) &=& 1 \\ \flt_N(z_1, + \infty) &=& 1 \\ \flt_N(z_1, - \infty) &=& 0 \\ \flt_N(\pm 0, \mp 0) &=& 0 \\ \flt_N(z_1, z_2) &=& \bool(z_1 < z_2) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return 0.
- Else if z_1 and z_2 are the same value, then return 0.
- Else if z_1 is positive infinity, then return 1.
- Else if z_1 is negative infinity, then return 0.
- Else if z_2 is positive infinity, then return 0.
- Else if z_2 is negative infinity, then return 1.
- Else if both z_1 and z_2 are zeroes, then return 0.
- Else if z_1 is larger than z_2, then return 1.
- Else return 0.
\begin{array}{@{}lcll} \fgt_N(\pm \NAN(n), z_2) &=& 0 \\ \fgt_N(z_1, \pm \NAN(n)) &=& 0 \\ \fgt_N(z, z) &=& 0 \\ \fgt_N(+ \infty, z_2) &=& 1 \\ \fgt_N(- \infty, z_2) &=& 0 \\ \fgt_N(z_1, + \infty) &=& 0 \\ \fgt_N(z_1, - \infty) &=& 1 \\ \fgt_N(\pm 0, \mp 0) &=& 0 \\ \fgt_N(z_1, z_2) &=& \bool(z_1 > z_2) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return 0.
- Else if z_1 and z_2 are the same value, then return 1.
- Else if z_1 is positive infinity, then return 0.
- Else if z_1 is negative infinity, then return 1.
- Else if z_2 is positive infinity, then return 1.
- Else if z_2 is negative infinity, then return 0.
- Else if both z_1 and z_2 are zeroes, then return 1.
- Else if z_1 is smaller than or equal to z_2, then return 1.
- Else return 0.
\begin{array}{@{}lcll} \fle_N(\pm \NAN(n), z_2) &=& 0 \\ \fle_N(z_1, \pm \NAN(n)) &=& 0 \\ \fle_N(z, z) &=& 1 \\ \fle_N(+ \infty, z_2) &=& 0 \\ \fle_N(- \infty, z_2) &=& 1 \\ \fle_N(z_1, + \infty) &=& 1 \\ \fle_N(z_1, - \infty) &=& 0 \\ \fle_N(\pm 0, \mp 0) &=& 1 \\ \fle_N(z_1, z_2) &=& \bool(z_1 \leq z_2) \\ \end{array}
- If either z_1 or z_2 is a NaN, then return 0.
- Else if z_1 and z_2 are the same value, then return 1.
- Else if z_1 is positive infinity, then return 1.
- Else if z_1 is negative infinity, then return 0.
- Else if z_2 is positive infinity, then return 0.
- Else if z_2 is negative infinity, then return 1.
- Else if both z_1 and z_2 are zeroes, then return 1.
- Else if z_1 is smaller than or equal to z_2, then return 1.
- Else return 0.
\begin{array}{@{}lcll} \fge_N(\pm \NAN(n), z_2) &=& 0 \\ \fge_N(z_1, \pm \NAN(n)) &=& 0 \\ \fge_N(z, z) &=& 1 \\ \fge_N(+ \infty, z_2) &=& 1 \\ \fge_N(- \infty, z_2) &=& 0 \\ \fge_N(z_1, + \infty) &=& 0 \\ \fge_N(z_1, - \infty) &=& 1 \\ \fge_N(\pm 0, \mp 0) &=& 1 \\ \fge_N(z_1, z_2) &=& \bool(z_1 \geq z_2) \\ \end{array}
- If z_2 is less than z_1 then return z_2.
- Else return z_1.
\begin{array}{@{}lcll} \fpmin_N(z_1, z_2) &=& z_2 & (\iff \flt_N(z_2, z_1) = 1) \\ \fpmin_N(z_1, z_2) &=& z_1 & (\otherwise) \end{array}
- If z_1 is less than z_2 then return z_2.
- Else return z_1.
\begin{array}{@{}lcll} \fpmax_N(z_1, z_2) &=& z_2 & (\iff \flt_N(z_1, z_2) = 1) \\ \fpmax_N(z_1, z_2) &=& z_1 & (\otherwise) \end{array}
- Return i.
\begin{array}{lll@{\qquad}l} \extendu_{M,N}(i) &=& i \\ \end{array}
Note
In the abstract syntax, unsigned extension just reinterprets the same value.
- Let j be the :ref:`signed interpretation <aux-signed>` of i of size M.
- Return the two's complement of j relative to size N.
\begin{array}{lll@{\qquad}l} \extends_{M,N}(i) &=& \signed_N^{-1}(\signed_M(i)) \\ \end{array}
- Return i modulo 2^N.
\begin{array}{lll@{\qquad}l} \wrap_{M,N}(i) &=& i \mod 2^N \\ \end{array}
- If z is a NaN, then the result is undefined.
- Else if z is an infinity, then the result is undefined.
- Else if z is a number and \trunc(z) is a value within range of the target type, then return that value.
- Else the result is undefined.
\begin{array}{lll@{\qquad}l} \truncu_{M,N}(\pm \NAN(n)) &=& \{\} \\ \truncu_{M,N}(\pm \infty) &=& \{\} \\ \truncu_{M,N}(\pm q) &=& \trunc(\pm q) & (\iff -1 < \trunc(\pm q) < 2^N) \\ \truncu_{M,N}(\pm q) &=& \{\} & (\otherwise) \\ \end{array}
Note
This operator is :ref:`partial <exec-op-partial>`. It is not defined for NaNs, infinities, or values for which the result is out of range.
- If z is a NaN, then the result is undefined.
- Else if z is an infinity, then the result is undefined.
- If z is a number and \trunc(z) is a value within range of the target type, then return that value.
- Else the result is undefined.
\begin{array}{lll@{\qquad}l} \truncs_{M,N}(\pm \NAN(n)) &=& \{\} \\ \truncs_{M,N}(\pm \infty) &=& \{\} \\ \truncs_{M,N}(\pm q) &=& \trunc(\pm q) & (\iff -2^{N-1} - 1 < \trunc(\pm q) < 2^{N-1}) \\ \truncs_{M,N}(\pm q) &=& \{\} & (\otherwise) \\ \end{array}
Note
This operator is :ref:`partial <exec-op-partial>`. It is not defined for NaNs, infinities, or values for which the result is out of range.
- If z is a NaN, then return 0.
- Else if z is negative infinity, then return 0.
- Else if z is positive infinity, then return 2^N - 1.
- Else, return \satu_N(\trunc(z)).
\begin{array}{lll@{\qquad}l} \truncsatu_{M,N}(\pm \NAN(n)) &=& 0 \\ \truncsatu_{M,N}(- \infty) &=& 0 \\ \truncsatu_{M,N}(+ \infty) &=& 2^N - 1 \\ \truncsatu_{M,N}(z) &=& \satu_N(\trunc(z)) \\ \end{array}
- If z is a NaN, then return 0.
- Else if z is negative infinity, then return -2^{N-1}.
- Else if z is positive infinity, then return 2^{N-1} - 1.
- Else, return \sats_N(\trunc(z)).
\begin{array}{lll@{\qquad}l} \truncsats_{M,N}(\pm \NAN(n)) &=& 0 \\ \truncsats_{M,N}(- \infty) &=& -2^{N-1} \\ \truncsats_{M,N}(+ \infty) &=& 2^{N-1}-1 \\ \truncsats_{M,N}(z) &=& \sats_N(\trunc(z)) \\ \end{array}
- If z is a :ref:`canonical NaN <canonical-nan>`, then return an element of \nans_N\{\} (i.e., a canonical NaN of size N).
- Else if z is a NaN, then return an element of \nans_N\{\pm \NAN(1)\} (i.e., any :ref:`arithmetic NaN <arithmetic-nan>` of size N).
- Else, return z.
\begin{array}{lll@{\qquad}l} \promote_{M,N}(\pm \NAN(n)) &=& \nans_N\{\} & (\iff n = \canon_N) \\ \promote_{M,N}(\pm \NAN(n)) &=& \nans_N\{+ \NAN(1)\} & (\otherwise) \\ \promote_{M,N}(z) &=& z \\ \end{array}
- If z is a :ref:`canonical NaN <canonical-nan>`, then return an element of \nans_N\{\} (i.e., a canonical NaN of size N).
- Else if z is a NaN, then return an element of \nans_N\{\pm \NAN(1)\} (i.e., any NaN of size N).
- Else if z is an infinity, then return that infinity.
- Else if z is a zero, then return that zero.
- Else, return \ieee_N(z).
\begin{array}{lll@{\qquad}l} \demote_{M,N}(\pm \NAN(n)) &=& \nans_N\{\} & (\iff n = \canon_N) \\ \demote_{M,N}(\pm \NAN(n)) &=& \nans_N\{+ \NAN(1)\} & (\otherwise) \\ \demote_{M,N}(\pm \infty) &=& \pm \infty \\ \demote_{M,N}(\pm 0) &=& \pm 0 \\ \demote_{M,N}(\pm q) &=& \ieee_N(\pm q) \\ \end{array}
- Return \ieee_N(i).
\begin{array}{lll@{\qquad}l} \convertu_{M,N}(i) &=& \ieee_N(i) \\ \end{array}
- Let j be the :ref:`signed interpretation <aux-signed>` of i.
- Return \ieee_N(j).
\begin{array}{lll@{\qquad}l} \converts_{M,N}(i) &=& \ieee_N(\signed_M(i)) \\ \end{array}
- Let d^\ast be the bit sequence \bits_{t_1}(c).
- Return the constant c' for which \bits_{t_2}(c') = d^\ast.
\begin{array}{lll@{\qquad}l} \reinterpret_{t_1,t_2}(c) &=& \bits_{t_2}^{-1}(\bits_{t_1}(c)) \\ \end{array}
- Let j be the :ref:`signed interpretation <aux-signed>` of i of size M.
- Return \sats_N(j).
\begin{array}{lll@{\qquad}l} \narrows_{M,N}(i) &=& \sats_N(\signed_M(i)) \end{array}
- Let j be the :ref:`signed interpretation <aux-signed>` of i of size M.
- Return \satu_N(j).
\begin{array}{lll@{\qquad}l} \narrowu_{M,N}(i) &=& \satu_N(\signed_M(i)) \end{array}
The result of relaxed operators are host-dependent, because the set of possible results may depend on properties of the host environment (such as hardware). Technically, each such operator produces a fixed-size list of sets of allowed values. For each execution of the operator in the same environment, only values from the set at the same position in the list are returned, i.e., each environment globally chooses a fixed projection for each operator.
Note
Each operator can be thought of as a family of operations that is fixed to one particular choice by the execution environment. The fixed operation itself can still be non-deterministic or partial.
The function \fma is the same as fusedMultiplyAdd defined by |IEEE754|_ (Section 5.4.1). It computes (z_1 \cdot z_2) + z_3 as if with unbounded range and precision, rounding only once for the final result.
- If either z_1 or z_2 or z_3 is a NaN, return an element of \nans_N{z_1, z_2, z_3}.
- Else if either z_1 or z_2 is a zero and the other is an infinity, then return an element of \nans_N\{\}.
- Else if both z_1 or z_2 are infinities of equal sign, and z_3 is a negative infinity, then return an element of \nans_N\{\}.
- Else if both z_1 or z_2 are infinities of opposite sign, and z_3 is a positive infinity, then return an element of \nans_N\{\}.
- Else if either z_1 or z_2 is an infinity and the other is a value of the same sign, and z_3 is a negative infinity, then return an element of \nans_N\{\}.
- Else if either z_1 or z_2 is an infinity and the other is a value of the opposite sign, and z_3 is a positive infinity, then return an element of \nans_N\{\}.
- Else if both z_1 and z_2 are zeroes of the same sign and z_3 is a zero, then return positive zero.
- Else if both z_1 and z_2 are zeroes of the opposite sign and z_3 is a positive zero, then return positive zero.
- Else if both z_1 and z_2 are zeroes of the opposite sign and z_3 is a negative zero, then return negative zero.
- Else return the result of multiplying z_1 and z_2, adding z_3 to the intermediate, and the final result ref:rounded <aux-ieee> to the nearest representable value.
\begin{array}{@{}llcll} & \fma_N(\pm \NAN(n), z_2, z_3) &=& \nans_N\{\pm \NAN(n), z_2, z_3\} \\ & \fma_N(z_1, \pm \NAN(n), z_3) &=& \nans_N\{\pm \NAN(n), z_1, z_3\} \\ & \fma_N(z_1, z_2, \pm \NAN(n)) &=& \nans_N\{\pm \NAN(n), z_1, z_2\} \\ & \fma_N(\pm \infty, \pm 0, z_3) &=& \nans_N\{\} \\ & \fma_N(\pm \infty, \mp 0, z_3) &=& \nans_N\{\} \\ & \fma_N(\pm \infty, \pm \infty, - \infty) &=& \nans_N\{\} \\ & \fma_N(\pm \infty, \mp \infty, + \infty) &=& \nans_N\{\} \\ & \fma_N(\pm q_1, \pm \infty, - \infty) &=& \nans_N\{\} \\ & \fma_N(\pm q_1, \mp \infty, + \infty) &=& \nans_N\{\} \\ & \fma_N(\pm \infty, \pm q_1, - \infty) &=& \nans_N\{\} \\ & \fma_N(\mp \infty, \pm q_1, + \infty) &=& \nans_N\{\} \\ & \fma_N(\pm 0, \pm 0, \mp 0) &=& + 0 \\ & \fma_N(\pm 0, \pm 0, \pm 0) &=& + 0 \\ & \fma_N(\pm 0, \mp 0, + 0) &=& + 0 \\ & \fma_N(\pm 0, \mp 0, - 0) &=& - 0 \\ & \fma_N(z_1, z_2, z_3) &=& \ieee_N(z_1 \cdot z_2 + z_3) \\ \end{array}
Relaxed multiply-add allows for fused or unfused results.
- \EXPROFDET Return either \fadd_N(\fmul_N(z_1, z_2), z_3) or \fma_N(z_1, z_2, z_3)
- Return \fma_N(z_1, z_2, z_3)
\begin{array}{@{}llcll} \EXPROFDET & \relaxedmadd_N(z_1, z_2, z_3) &=& [ \fadd_N(\fmul_N(z_1, z_2), z_3), \fma_N(z_1, z_2, z_3) ] \\ & \relaxedmadd_N(z_1, z_2, z_3) &=& \fma_N(z_1, z_2, z_3) \\ \end{array}
Relaxed negative multiply-add allows for fused or unfused results.
- Return \relaxedmadd(-z_1, z_2, z_3).
\begin{array}{@{}llcll} & \relaxednmadd_N(z_1, z_2, z_3) &=& \relaxedmadd_N(-z_1, z_2, z_3) \\ \end{array}
- Let k be the :ref:`signed interpretation <aux-signed>` of j.
- If j is less than 16, return i[j].
- If k is less than 0, return 0.
- \EXPROFDET Otherwise, return either 0 or i[j \mod n].
- Otherwise, return 0.
\begin{array}{@{}llcll} & \relaxedswizzlelane(i^n, j) &=& i[j] & (\iff j < 16) \\ & \relaxedswizzlelane(i^n, j) &=& 0 & (\iff \signed_8(j) < 0) \\ \EXPROFDET & \relaxedswizzlelane(i^n, j) &=& [ 0, i[j \mod n] ] & (\otherwise) \\ & \relaxedswizzlelane(i^n, j) &=& 0 & (\otherwise) \\ \end{array}
Relaxed swizzle lane is deterministic if the signed interpretation of the index is less than 16 (including negative values). j is a 8-bit int.
- Return \X{rsl}_0 \dots \X{rsl}_{n-1} where \X{rsl}_i = \relaxedswizzlelane(a^n, s^n[i])
\begin{array}{@{}llcll} & \relaxedswizzle(a^n, s^n) &=& \X{rsl}_0 \dots \X{rsl}_{n-1} \\ & \qquad \where \X{rsl}_i &=& \relaxedswizzlelane(a^n, s^n[i]) \end{array}
Relaxed unsigned truncation converts floating point numbers to integers. The result for NaN's and out-of-range values is host-dependent.
- \EXPROFDET If z is a NaN, return either 0 or 2^N-1 or 2^N-2 or 2^(N-1).
- \EXPROFDET Else if \trunc(z) is positive and less than 2^N, return \truncu_{M,N}(z).
- \EXPROFDET Else return either \truncsatu_{M,N}(z) or 2^N-1 or 2^N-2 or 2^(N-1).
- Return \truncsatu_{M,N}(z).
\begin{array}{@{}llcll} \EXPROFDET & \relaxedtrunc^u_{M,N}(\pm \NAN(n)) &=& [ 0, 2^{N}-1, 2^{N}-2, 2^{N-1}] \\ \EXPROFDET & \relaxedtrunc^u_{M,N}(\pm q) &=& \truncu_{M,N}(\pm q) & (\iff -1 < \trunc(\pm q) < 2^N) \\ \EXPROFDET & \relaxedtrunc^u_{M,N}(\pm p) &=& [ \truncsatu_{M,N}(\pm p), 2^{N}-1, 2^{N}-2, 2^{N-1}] & (\otherwise) \\ & \relaxedtrunc^u_{M,N}(z) &=& \truncsatu_{M,N}(z) & \\ \end{array}
Relaxed signed truncation converts floating point numbers to integers. The result for NaN's and out-of-range values is host-dependent.
- \EXPROFDET If z is a NaN, return either 0 or -2^{N-1}.
- \EXPROFDET Else if \trunc(z) is larger than -2^{N-1}-1 and less than 2^{N-1}, return \truncs_{M,N}(z).
- \EXPROFDET Else return either \truncsats_{M,N}(z) or -2^{N-1}.
- Return \truncsats_{M,N}(z).
\begin{array}{@{}llcll} \EXPROFDET & \relaxedtrunc^s_{M,N}(\pm \NAN(n)) &=& [ 0, -2^{N-1} ] \\ \EXPROFDET & \relaxedtrunc^s_{M,N}(\pm q) &=& \truncs_{M,N}(\pm q) & (\iff -2^{N-1} - 1 < \trunc(\pm q) < 2^{N-1}) \\ \EXPROFDET & \relaxedtrunc^s_{M,N}(\pm p) &=& [ \truncsats_{M,N}(\pm p), -2^{N-1} ] & (\otherwise) \\ & \relaxedtrunc^s_{M,N}(z) &=& \truncsats_{M,N}(z) & \\ \end{array}
- \EXPROFDET If i_3 is 2^N - 1, return i_1.
- \EXPROFDET Else if i_3 is 0, return i_2.
- \EXPROFDET Otherwise return either \ibitselect_n(i_1, i_2, i_3) or i_1 or i_2 or \F{top\_bit\_byteselect_N}(i_1, i_2, i_3).
- Return \ibitselect_n(i_1, i_2, i_3).
\begin{array}{@{}llcll} \EXPROFDET & \relaxedlane_N(i_1, i_2, 2^N-1) &=& i_1 \\ \EXPROFDET & \relaxedlane_N(i_1, i_2, 0) &=& i_2 \\ \EXPROFDET & \relaxedlane_N(i_1, i_2, i_3) &=& [ \ibitselect_N(i_1, i_2, i_3), i_2, i_3, \\ & & & \qquad \F{top\_bit\_byteselect}(i_1, i_2, i_3)] & (\otherwise) \\ & \relaxedlane_N(i_1, i_2, i_3) &=& \ibitselect_N(i_1, i_2, i_3) & (\otherwise) \\ \end{array}
where:
\begin{array}{@{}llcll} & \F{top\_bit\_byteselect}_N(i_1, i_2, i_3) &=& tbb_0 ... tbb_{N/8 - 1} \\ & \F{tbb_j} &=& \F{byteselect}(\bytes_8(i_1)[j], \bytes_8(i_2)[j], \bytes_8(i_3)[j]) \\ & \F{byteselect}(a, b, 0~c^7) &=& a \\ & \F{byteselect}(a, b, c) &=& b \\ \end{array}
Relaxed lane selection is deterministic when all bits are set or unset in the mask. Otherwise depending on the host, either only the top bit is examined, or all bits are examined (i.e. it becomes a bit select), or the top bit of each byte in the lane is examined.
- Return rll_0 \dots rll_{n-1} where rll_i = \relaxedlane_B(a^n[i], b^n[i], c^n[i]).
\begin{array}{@{}llcll} & \relaxedlaneselect_B(a^n, b^n, c^n) &=& rll_0 \dots rll_{n-1} \\ & \qquad \where rll_i &=& \relaxedlane_B(a^n[i], b^n[i], c^n[i]) \\ \end{array}
Relaxed minimum differs from regular minimum when inputs are NaN's or zeroes with different signs. It allows for implementation to return the first or second input when either input is a NaN.
- \EXPROFDET If z_1 is a NaN, return either an element of \nans_N\{z_1, z_2\}, \NAN(n), or z_2
- \EXPROFDET If z_2 is a NaN, return either an element of \nans_N\{z_1, z_2\}, \NAN(n), or z_1
- \EXPROFDET If both z_1 and z_2 are zeroes of opposite sign, return either + 0 or - 0.
- Return \fmin_N(z_1, z_2).
\begin{array}{@{}llcll} \EXPROFDET & \relaxedmin_N(\pm \NAN(n), z_2) &=& [ \nans_N\{\pm \NAN(n), z_2\}, \NAN(n), z_2, z_2 ] \\ \EXPROFDET & \relaxedmin_N(z_1, \pm \NAN(n)) &=& [ \nans_N\{\pm \NAN(n), z_1\}, z_1, \NAN(n), z_1 ] \\ \EXPROFDET & \relaxedmin_N(\pm 0, \mp 0) &=& [ -0, \pm 0, \mp 0, -0 ] \\ & \relaxedmin_N(z_1, z_2) &=& \fmin_N(z_1, z_2) & (\otherwise) \\ \end{array}
Relaxed maximum differs from regular maximum when inputs are NaN's or zeroes with different signs. It allows for implementations to return the first or second input when either input is a NaN.
- \EXPROFDET If z_1 is a NaN, return either an element of \nans_N\{z_1, z_2\}, \NAN(n), or z_2
- \EXPROFDET If z_2 is a NaN, return either an element of \nans_N\{z_1, z_2\}, \NAN(n), or z_1
- \EXPROFDET If both z_1 and z_2 are zeroes of opposite sign, return either + 0 or - 0.
- Return \fmax_N(z_1, z_2).
\begin{array}{@{}llcll} \EXPROFDET & \relaxedmax_N(\pm \NAN(n), z_2) &=& [ \nans_N\{\pm \NAN(n), z_2\}, \NAN(n), z_2, z_2 ] \\ \EXPROFDET & \relaxedmax_N(z_1, \pm \NAN(n)) &=& [ \nans_N\{\pm \NAN(n), z_1\}, z_1, \NAN(n), z_1 ] \\ \EXPROFDET & \relaxedmax_N(\pm 0, \mp 0) &=& [ +0, \pm 0, \mp 0, +0 ] \\ & \relaxedmax_N(z_1, z_2) &=& \fmax_N(z_1, z_2) & (\otherwise) \\ \end{array}
Relaxed Q15 multiply differs from regular Q15 multiply when the multiplication results overflows (i.e. when both inputs are -32768). It allows for implementations to either wrap around or saturate.
- \EXPROFDET If both z_1 and z_2 are -2^{N-1}, return either 2^{N-1} - 1 or -2^{N-1}.
- Return \iq15mulrsats(i_1, i_2)
\begin{array}{@{}llcll} \EXPROFDET & \relaxedq15mulrs_N(-2^{N-1}, -2^{N-1}) &=& [ 2^{N-1}-1, -2^{N-1}] & \\ & \relaxedq15mulrs_N(i_1, i_2) &=& \iq15mulrsats(i_1, i_2) \end{array}
Relaxed integer dot product differs from regular integer dot product when the elements of the input have their most significant bit set.
- \EXPROFDET Return either \imul_N(\signed_M(i_1), i_2), \imul_N(\signed_M(i_1), \signed_M(i_2)).
- Return \imul_N(\extends_{M,N}(i_1), \extends_{M,N}(i_2)).
\begin{array}{@{}llcll} \EXPROFDET & \relaxeddotmul_{M,N}(i_1, i_2) &=& [ \imul_N(\signed_M(i_1), i_2), \imul_N(\signed_M(i_1), \signed_M(i_2)) ] \\ & \relaxeddotmul_{M,N}(i_1, i_2) &=& \imul_N(\extends_{M,N}(i_1), \extends_{M,N}(i_2)) \\ \end{array}