Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relaxed fused multiply-add and fused multiply-subtract #27

Open
ngzhian opened this issue Jun 16, 2021 · 9 comments
Open

relaxed fused multiply-add and fused multiply-subtract #27

ngzhian opened this issue Jun 16, 2021 · 9 comments
Labels
in-overview Instruction has been added to Overview.md instruction-proposal

Comments

@ngzhian
Copy link
Member

ngzhian commented Jun 16, 2021

Note: this instruction proposal is migrated from WebAssembly/simd#79

  1. What are the instructions being proposed?
  • relaxed f32x4.fma
  • relaxed f32x4.fms
  • relaxed f64x2.fma
  • relaxed f64x2.fms
  1. What are the semantics of these instructions?

All the instructions take 3 operands, a, b, c, perform (a * b) + c or -(a * b) + c:

  • relaxed f32x4.fma(a, b, c) = (a * b) + c
  • relaxed f32x4.fms(a, b, c) = (a * b) + c
  • relaxed f64x2.fma(a, b, c) = -(a * b) + c
  • relaxed f64x2.fms(a, b, c) = -(a * b) + c

where:

  • the intermediate a * b is be rounded first, and the final result rounded again (for a total of 2 roundings), or
  • the the entire expression evaluated with higher precision and then only rounded once.
  1. How will these instructions be implemented? Give examples for at least
    x86-64 and ARM64. Also provide reference implementation in terms of 128-bit
    Wasm SIMD.

Detailed implementation guidance available at WebAssembly/simd#79, below is an overview

x86/x86-64 with FMA3

  • relaxed f32x4.fma = VFMADD213PS
  • relaxed f32x4.fms = VFNMADD213PS
  • relaxed f64x2.fma = VFMADD213PS
  • relaxed f64x2.fms = VFNMADD213PS

ARM64

  • relaxed f32x4.fma = FMLA
  • relaxed f32x4.fms = FMLS
  • relaxed f64x2.fma = FMLA
  • relaxed f64x2.fms = FMLS

ARMv7 with FMA (Neon v2)

  • relaxed f32x4.fma = VFMA
  • relaxed f32x4.fms = VFMS
  • relaxed f64x2.fma = VFMA
  • relaxed f64x2.fms = VFMS

ARMv7 without FMA (2 rounding)

  • relaxed f32x4.fma = VMLA
  • relaxed f32x4.fms = VMLS
  • relaxed f64x2.fma = VMLA
  • relaxed f64x2.fms = VMLS

Note: Armv8-M will require MVE-F (floating point extension)

RISC-V V

  • relaxed f32x4.fma = vfmacc.vv
  • relaxed f32x4.fms = vfnmsac.vv
  • relaxed f64x2.fma = vfmadd.vv
  • relaxed f64x2.fms = vfnmsac.vv

simd128

  • relaxed f32x4.fma = f32x4.add(f32x4.mul)
  • relaxed f32x4.fms = f32x4.sub(f32x4.mul)
  • relaxed f64x2.fma = f64x2.add(f64x2.mul)
  • relaxed f64x2.fms = f64x2.sub(f64x2.mul)
  1. How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

The difference depends on whether hardware supports FMA or not. The dividing line is between newer and older hardware. Newer (Intel Haswell from 2013 onwards, AMD ZEN from 2017, Cortex-A5 since 2011) hardware tends to come with hardware FMA support so we will probably see less and less hardware without FMA

  1. What use cases are there?

Many, especially machine learning (neural nets). Fused multiply-add improves accuracy in numerical algorithms, improves floating-point throughput, and reduces register pressures in some cases. An early prototype and evaluation also showed significant speedup on multiple neural-network models.

@ngzhian ngzhian changed the title relaxed fused multiply-add relaxed fused multiply-add and fused multiply-subtract Jun 16, 2021
ngzhian added a commit to ngzhian/relaxed-simd that referenced this issue Jun 29, 2021
@ngzhian ngzhian mentioned this issue Jun 29, 2021
ngzhian added a commit that referenced this issue Jul 2, 2021
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Aug 13, 2021
Implement the fused multiply-add and fused multiply-sub relaxed SIMD
operations.

See WebAssembly/relaxed-simd#27 for proposed
spec of these operations.

There's no wat support for this yet - it will comes in separately - so
the test cases are a little rudimentary for now.  More tests will
appear later.

Differential Revision: https://phabricator.services.mozilla.com/D121870
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Aug 13, 2021
Implement the fused multiply-add and fused multiply-sub relaxed SIMD
operations.

See WebAssembly/relaxed-simd#27 for proposed
spec of these operations.

There's no wat support for this yet - it will comes in separately - so
the test cases are a little rudimentary for now.  More tests will
appear later.

Differential Revision: https://phabricator.services.mozilla.com/D121870
jamienicol pushed a commit to jamienicol/gecko that referenced this issue Aug 20, 2021
Implement the fused multiply-add and fused multiply-sub relaxed SIMD
operations.

See WebAssembly/relaxed-simd#27 for proposed
spec of these operations.

There's no wat support for this yet - it will comes in separately - so
the test cases are a little rudimentary for now.  More tests will
appear later.

Differential Revision: https://phabricator.services.mozilla.com/D121870
jamienicol pushed a commit to jamienicol/gecko that referenced this issue Aug 20, 2021
Implement the fused multiply-add and fused multiply-sub relaxed SIMD
operations.

See WebAssembly/relaxed-simd#27 for proposed
spec of these operations.

There's no wat support for this yet - it will comes in separately - so
the test cases are a little rudimentary for now.  More tests will
appear later.

Differential Revision: https://phabricator.services.mozilla.com/D121870
@ngzhian
Copy link
Member Author

ngzhian commented Nov 1, 2021

Power ISA has vmaddfp and vnmsubfp.

@ngzhian ngzhian added the in-overview Instruction has been added to Overview.md label Feb 18, 2022
@ngzhian
Copy link
Member Author

ngzhian commented Apr 12, 2022

Example of algorithms requiring fused multiply add https://hal.inria.fr/inria-00000895/document (look for Fast2Mult)

@ngzhian
Copy link
Member Author

ngzhian commented Jul 20, 2022

@yurydelendik brought up a good point at #77 (comment)
fma(x,y,z) should be (x*y)+z, this follows https://en.cppreference.com/w/cpp/numeric/math/fma

This ordering does change the implementation, e.g. on AArch64, f32.qfma(a, b, c) will be FMLA c, a, b.

@Maratyszcza
Copy link
Collaborator

Maratyszcza commented Jul 21, 2022

I agree with this point, however we'd need to rename fms into Fused Negative-Multiply-Add (fnma), as Fused Multiply-Subtract is supposed to do fms(x, y, z) = (x*y) - z in this notation.

@yurydelendik
Copy link
Contributor

I was under impression that we have fms only to have a negative product added to the total sum -x*y, e.g. during determinant calculations. Is it okay to still have fms(x, y, z) = -(x*y) + z ?

@Maratyszcza
Copy link
Collaborator

I'm not suggesting to remove the quasi-fused -(x*y) + z operation, I suggest that it should be called fnma (Fused Negative Multiply-Add) to better reflect the actual operations involved. In my experience, this instruction is most useful in Newton-Raphson iterations.

@ngzhian
Copy link
Member Author

ngzhian commented Aug 4, 2022

Adding @dtig fyi, for comments.

@dtig
Copy link
Member

dtig commented Aug 4, 2022

Keeping qfma(x,y,z) consistent with cpp intrinsics, and the name edits for fnma sgtm. I can make the necessary changes in V8 when merged into the overview doc.

@ngzhian
Copy link
Member Author

ngzhian commented Aug 4, 2022

i3roly pushed a commit to i3roly/firefox-dynasty that referenced this issue Jun 1, 2024
Implement the fused multiply-add and fused multiply-sub relaxed SIMD
operations.

See WebAssembly/relaxed-simd#27 for proposed
spec of these operations.

There's no wat support for this yet - it will comes in separately - so
the test cases are a little rudimentary for now.  More tests will
appear later.

Differential Revision: https://phabricator.services.mozilla.com/D121870
i3roly pushed a commit to i3roly/firefox-dynasty that referenced this issue Jun 1, 2024
Implement the fused multiply-add and fused multiply-sub relaxed SIMD
operations.

See WebAssembly/relaxed-simd#27 for proposed
spec of these operations.

There's no wat support for this yet - it will comes in separately - so
the test cases are a little rudimentary for now.  More tests will
appear later.

Differential Revision: https://phabricator.services.mozilla.com/D121870
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-overview Instruction has been added to Overview.md instruction-proposal
Projects
None yet
Development

No branches or pull requests

4 participants