relaxed fused multiply-add and fused multiply-subtract #27

ngzhian · 2021-06-16T22:30:39Z

Note: this instruction proposal is migrated from WebAssembly/simd#79

What are the instructions being proposed?

relaxed f32x4.fma
relaxed f32x4.fms
relaxed f64x2.fma
relaxed f64x2.fms

What are the semantics of these instructions?

All the instructions take 3 operands, a, b, c, perform (a * b) + c or -(a * b) + c:

relaxed f32x4.fma(a, b, c) = (a * b) + c
relaxed f32x4.fms(a, b, c) = (a * b) + c
relaxed f64x2.fma(a, b, c) = -(a * b) + c
relaxed f64x2.fms(a, b, c) = -(a * b) + c

where:

the intermediate a * b is be rounded first, and the final result rounded again (for a total of 2 roundings), or
the the entire expression evaluated with higher precision and then only rounded once.

How will these instructions be implemented? Give examples for at least
x86-64 and ARM64. Also provide reference implementation in terms of 128-bit
Wasm SIMD.

Detailed implementation guidance available at WebAssembly/simd#79, below is an overview

x86/x86-64 with FMA3

relaxed f32x4.fma = VFMADD213PS
relaxed f32x4.fms = VFNMADD213PS
relaxed f64x2.fma = VFMADD213PS
relaxed f64x2.fms = VFNMADD213PS

ARM64

relaxed f32x4.fma = FMLA
relaxed f32x4.fms = FMLS
relaxed f64x2.fma = FMLA
relaxed f64x2.fms = FMLS

ARMv7 with FMA (Neon v2)

relaxed f32x4.fma = VFMA
relaxed f32x4.fms = VFMS
relaxed f64x2.fma = VFMA
relaxed f64x2.fms = VFMS

ARMv7 without FMA (2 rounding)

relaxed f32x4.fma = VMLA
relaxed f32x4.fms = VMLS
relaxed f64x2.fma = VMLA
relaxed f64x2.fms = VMLS

Note: Armv8-M will require MVE-F (floating point extension)

RISC-V V

relaxed f32x4.fma = vfmacc.vv
relaxed f32x4.fms = vfnmsac.vv
relaxed f64x2.fma = vfmadd.vv
relaxed f64x2.fms = vfnmsac.vv

simd128

relaxed f32x4.fma = f32x4.add(f32x4.mul)
relaxed f32x4.fms = f32x4.sub(f32x4.mul)
relaxed f64x2.fma = f64x2.add(f64x2.mul)
relaxed f64x2.fms = f64x2.sub(f64x2.mul)

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

The difference depends on whether hardware supports FMA or not. The dividing line is between newer and older hardware. Newer (Intel Haswell from 2013 onwards, AMD ZEN from 2017, Cortex-A5 since 2011) hardware tends to come with hardware FMA support so we will probably see less and less hardware without FMA

What use cases are there?

Many, especially machine learning (neural nets). Fused multiply-add improves accuracy in numerical algorithms, improves floating-point throughput, and reduces register pressures in some cases. An early prototype and evaluation also showed significant speedup on multiple neural-network models.

The text was updated successfully, but these errors were encountered:

See WebAssembly#27.

See #27.

Implement the fused multiply-add and fused multiply-sub relaxed SIMD operations. See WebAssembly/relaxed-simd#27 for proposed spec of these operations. There's no wat support for this yet - it will comes in separately - so the test cases are a little rudimentary for now. More tests will appear later. Differential Revision: https://phabricator.services.mozilla.com/D121870

ngzhian · 2021-11-01T22:23:54Z

Power ISA has vmaddfp and vnmsubfp.

ngzhian · 2022-04-12T21:56:26Z

Example of algorithms requiring fused multiply add https://hal.inria.fr/inria-00000895/document (look for Fast2Mult)

ngzhian · 2022-07-20T23:17:20Z

@yurydelendik brought up a good point at #77 (comment)
fma(x,y,z) should be (x*y)+z, this follows https://en.cppreference.com/w/cpp/numeric/math/fma

This ordering does change the implementation, e.g. on AArch64, f32.qfma(a, b, c) will be FMLA c, a, b.

Maratyszcza · 2022-07-21T03:06:47Z

I agree with this point, however we'd need to rename fms into Fused Negative-Multiply-Add (fnma), as Fused Multiply-Subtract is supposed to do fms(x, y, z) = (x*y) - z in this notation.

yurydelendik · 2022-07-21T21:37:31Z

I was under impression that we have fms only to have a negative product added to the total sum -x*y, e.g. during determinant calculations. Is it okay to still have fms(x, y, z) = -(x*y) + z ?

Maratyszcza · 2022-07-22T02:28:56Z

I'm not suggesting to remove the quasi-fused -(x*y) + z operation, I suggest that it should be called fnma (Fused Negative Multiply-Add) to better reflect the actual operations involved. In my experience, this instruction is most useful in Newton-Raphson iterations.

ngzhian · 2022-08-04T18:50:49Z

Adding @dtig fyi, for comments.

dtig · 2022-08-04T22:28:26Z

Keeping qfma(x,y,z) consistent with cpp intrinsics, and the name edits for fnma sgtm. I can make the necessary changes in V8 when merged into the overview doc.

…add) For WebAssembly#27.

ngzhian · 2022-08-04T23:35:52Z

fnma seems to be different on different software:

-a * b + c

-(a * b) + c NOTE: this is the one that we are using in our spec

- (a * b + c)

https://www.ibm.com/docs/en/aix/7.1?topic=set-fnmadd-fnma-floating-negative-multiply-add-instruction

…add) For #27.

Implement the fused multiply-add and fused multiply-sub relaxed SIMD operations. See WebAssembly/relaxed-simd#27 for proposed spec of these operations. There's no wat support for this yet - it will comes in separately - so the test cases are a little rudimentary for now. More tests will appear later. Differential Revision: https://phabricator.services.mozilla.com/D121870

ngzhian added the instruction-proposal label Jun 16, 2021

ngzhian changed the title ~~relaxed fused multiply-add~~ relaxed fused multiply-add and fused multiply-subtract Jun 16, 2021

ngzhian added a commit to ngzhian/relaxed-simd that referenced this issue Jun 29, 2021

Add fma and fms

037082b

See WebAssembly#27.

ngzhian mentioned this issue Jun 29, 2021

Add fma and fms #28

Merged

ngzhian added a commit that referenced this issue Jul 2, 2021

Add fma and fms (#28)

9698769

See #27.

ngzhian added the in-overview Instruction has been added to Overview.md label Feb 18, 2022

tomrittervg mentioned this issue Jun 16, 2022

WebAssembly Relaxed SIMD mozilla/standards-positions#651

Open

Maratyszcza mentioned this issue Jul 21, 2022

Relaxed BFloat16 Dot Product instruction #77

Open

ngzhian added a commit to ngzhian/relaxed-simd that referenced this issue Aug 4, 2022

Fix ordering for FMA and change FMS to FNMA (fused negative multiply …

98025e0

…add) For WebAssembly#27.

ngzhian mentioned this issue Aug 4, 2022

Fix ordering for FMA and change FMS to FNMA #81

Merged

ngzhian added a commit that referenced this issue Sep 12, 2022

Fix ordering for FMA and change FMS to FNMA (fused negative multiply …

aff5ae0

…add) For #27.

yurydelendik mentioned this issue Sep 21, 2022

Fix fnma instruction name and details #89

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

relaxed fused multiply-add and fused multiply-subtract #27

relaxed fused multiply-add and fused multiply-subtract #27

ngzhian commented Jun 16, 2021 •

edited

Loading

ngzhian commented Nov 1, 2021

ngzhian commented Apr 12, 2022

ngzhian commented Jul 20, 2022

Maratyszcza commented Jul 21, 2022 •

edited

Loading

yurydelendik commented Jul 21, 2022

Maratyszcza commented Jul 22, 2022

ngzhian commented Aug 4, 2022

dtig commented Aug 4, 2022

ngzhian commented Aug 4, 2022 •

edited

Loading

relaxed fused multiply-add and fused multiply-subtract #27

relaxed fused multiply-add and fused multiply-subtract #27

Comments

ngzhian commented Jun 16, 2021 • edited Loading

ngzhian commented Nov 1, 2021

ngzhian commented Apr 12, 2022

ngzhian commented Jul 20, 2022

Maratyszcza commented Jul 21, 2022 • edited Loading

yurydelendik commented Jul 21, 2022

Maratyszcza commented Jul 22, 2022

ngzhian commented Aug 4, 2022

dtig commented Aug 4, 2022

ngzhian commented Aug 4, 2022 • edited Loading

ngzhian commented Jun 16, 2021 •

edited

Loading

Maratyszcza commented Jul 21, 2022 •

edited

Loading

ngzhian commented Aug 4, 2022 •

edited

Loading