-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
relaxed fused multiply-add and fused multiply-subtract #27
Comments
Implement the fused multiply-add and fused multiply-sub relaxed SIMD operations. See WebAssembly/relaxed-simd#27 for proposed spec of these operations. There's no wat support for this yet - it will comes in separately - so the test cases are a little rudimentary for now. More tests will appear later. Differential Revision: https://phabricator.services.mozilla.com/D121870
Implement the fused multiply-add and fused multiply-sub relaxed SIMD operations. See WebAssembly/relaxed-simd#27 for proposed spec of these operations. There's no wat support for this yet - it will comes in separately - so the test cases are a little rudimentary for now. More tests will appear later. Differential Revision: https://phabricator.services.mozilla.com/D121870
Implement the fused multiply-add and fused multiply-sub relaxed SIMD operations. See WebAssembly/relaxed-simd#27 for proposed spec of these operations. There's no wat support for this yet - it will comes in separately - so the test cases are a little rudimentary for now. More tests will appear later. Differential Revision: https://phabricator.services.mozilla.com/D121870
Implement the fused multiply-add and fused multiply-sub relaxed SIMD operations. See WebAssembly/relaxed-simd#27 for proposed spec of these operations. There's no wat support for this yet - it will comes in separately - so the test cases are a little rudimentary for now. More tests will appear later. Differential Revision: https://phabricator.services.mozilla.com/D121870
Power ISA has vmaddfp and vnmsubfp. |
Example of algorithms requiring fused multiply add https://hal.inria.fr/inria-00000895/document (look for Fast2Mult) |
@yurydelendik brought up a good point at #77 (comment) This ordering does change the implementation, e.g. on AArch64, |
I agree with this point, however we'd need to rename |
I was under impression that we have |
I'm not suggesting to remove the quasi-fused |
Adding @dtig fyi, for comments. |
Keeping |
fnma seems to be different on different software:
|
Implement the fused multiply-add and fused multiply-sub relaxed SIMD operations. See WebAssembly/relaxed-simd#27 for proposed spec of these operations. There's no wat support for this yet - it will comes in separately - so the test cases are a little rudimentary for now. More tests will appear later. Differential Revision: https://phabricator.services.mozilla.com/D121870
Implement the fused multiply-add and fused multiply-sub relaxed SIMD operations. See WebAssembly/relaxed-simd#27 for proposed spec of these operations. There's no wat support for this yet - it will comes in separately - so the test cases are a little rudimentary for now. More tests will appear later. Differential Revision: https://phabricator.services.mozilla.com/D121870
All the instructions take 3 operands,
a
,b
,c
, perform(a * b) + c
or-(a * b) + c
:relaxed f32x4.fma(a, b, c) = (a * b) + c
relaxed f32x4.fms(a, b, c) = (a * b) + c
relaxed f64x2.fma(a, b, c) = -(a * b) + c
relaxed f64x2.fms(a, b, c) = -(a * b) + c
where:
a * b
is be rounded first, and the final result rounded again (for a total of 2 roundings), orx86-64 and ARM64. Also provide reference implementation in terms of 128-bit
Wasm SIMD.
x86/x86-64 with FMA3
relaxed f32x4.fma
=VFMADD213PS
relaxed f32x4.fms
=VFNMADD213PS
relaxed f64x2.fma
=VFMADD213PS
relaxed f64x2.fms
=VFNMADD213PS
ARM64
relaxed f32x4.fma
=FMLA
relaxed f32x4.fms
=FMLS
relaxed f64x2.fma
=FMLA
relaxed f64x2.fms
=FMLS
ARMv7 with FMA (Neon v2)
relaxed f32x4.fma
=VFMA
relaxed f32x4.fms
=VFMS
relaxed f64x2.fma
=VFMA
relaxed f64x2.fms
=VFMS
ARMv7 without FMA (2 rounding)
relaxed f32x4.fma
=VMLA
relaxed f32x4.fms
=VMLS
relaxed f64x2.fma
=VMLA
relaxed f64x2.fms
=VMLS
Note: Armv8-M will require MVE-F (floating point extension)
RISC-V V
relaxed f32x4.fma
=vfmacc.vv
relaxed f32x4.fms
=vfnmsac.vv
relaxed f64x2.fma
=vfmadd.vv
relaxed f64x2.fms
=vfnmsac.vv
simd128
relaxed f32x4.fma
=f32x4.add(f32x4.mul)
relaxed f32x4.fms
=f32x4.sub(f32x4.mul)
relaxed f64x2.fma
=f64x2.add(f64x2.mul)
relaxed f64x2.fms
=f64x2.sub(f64x2.mul)
The difference depends on whether hardware supports FMA or not. The dividing line is between newer and older hardware. Newer (Intel Haswell from 2013 onwards, AMD ZEN from 2017, Cortex-A5 since 2011) hardware tends to come with hardware FMA support so we will probably see less and less hardware without FMA
Many, especially machine learning (neural nets). Fused multiply-add improves accuracy in numerical algorithms, improves floating-point throughput, and reduces register pressures in some cases. An early prototype and evaluation also showed significant speedup on multiple neural-network models.
The text was updated successfully, but these errors were encountered: