Faster EltwiseFMAModAVX512 #42

fboemer · 2021-08-11T18:01:42Z

1.5x - 1.7x speedup on EltwiseFMAModAVX512IFMA using a few tricks also used in the NTT:

Switch to IFMA 52-bit mullo when computing vq_times_mod. The speedup is likely due to better pipelining on the same port
Use the negative modulus, neg_p. This helps compute q := a * b - q * p via two steps: tmp = a * b; q = fma(tmp, q, neg_p) rather than the previous three steps: tmp = a*b; tmp2 = q*p; q = tmp - tmp2 (see line 2 of Algorithm 4 in https://arxiv.org/pdf/2012.01968.pdf for background)

Additionally, for the case when the addition argument arg3 != nullptr, merging the two modular reductions led to a slight speedup.

On ICX with clang-12, I see

Benchmark	Before	After	Speedup
BM_EltwiseFMAModAVX512IFMA/1024/0	0.196us	0.116us	1.69x
BM_EltwiseFMAModAVX512IFMA/8192/0	1.53us	1.03us	1.48x
BM_EltwiseFMAModAVX512IFMA/16384/0	3.05us	2.06us	1.51x
BM_EltwiseFMAModAVX512IFMA/1024/1	0.268us	0.177us	1.51x
BM_EltwiseFMAModAVX512IFMA/8192/1	2.22us	1.47us	1.51x
BM_EltwiseFMAModAVX512IFMA/16384/1	4.43us	2.93us	1.51x

hamishun

LGTM

fboemer added 3 commits August 11, 2021 10:27

Faster EltwiseFMAModAVX512

e1686a1

Add addition argument to EltwiseFMAMod benchmarks

601884c

Cleanup documentation

1011f8c

fboemer temporarily deployed to intel_workflow August 11, 2021 18:01 Inactive

fboemer marked this pull request as ready for review August 11, 2021 18:02

fboemer requested a review from a team as a code owner August 11, 2021 18:02

fboemer marked this pull request as draft August 11, 2021 18:02

fboemer temporarily deployed to intel_workflow August 11, 2021 18:03 Inactive

fboemer temporarily deployed to intel_workflow August 11, 2021 18:04 Inactive

hamishun approved these changes Aug 11, 2021

View reviewed changes

fboemer marked this pull request as ready for review August 11, 2021 18:18

jlhcrawford approved these changes Aug 11, 2021

View reviewed changes

fboemer merged commit 0be1221 into main Aug 11, 2021

fboemer deleted the fboemer/eltwise-fma-mod-avx512-speedup branch August 11, 2021 18:24

Provide feedback