Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster EltwiseFMAModAVX512 #42

Merged
merged 3 commits into from
Aug 11, 2021
Merged

Conversation

fboemer
Copy link
Contributor

@fboemer fboemer commented Aug 11, 2021

1.5x - 1.7x speedup on EltwiseFMAModAVX512IFMA using a few tricks also used in the NTT:

  1. Switch to IFMA 52-bit mullo when computing vq_times_mod. The speedup is likely due to better pipelining on the same port

  2. Use the negative modulus, neg_p. This helps compute q := a * b - q * p via two steps: tmp = a * b; q = fma(tmp, q, neg_p) rather than the previous three steps: tmp = a*b; tmp2 = q*p; q = tmp - tmp2 (see line 2 of Algorithm 4 in https://arxiv.org/pdf/2012.01968.pdf for background)

Additionally, for the case when the addition argument arg3 != nullptr, merging the two modular reductions led to a slight speedup.

On ICX with clang-12, I see

Benchmark Before After Speedup
BM_EltwiseFMAModAVX512IFMA/1024/0 0.196us 0.116us 1.69x
BM_EltwiseFMAModAVX512IFMA/8192/0 1.53us 1.03us 1.48x
BM_EltwiseFMAModAVX512IFMA/16384/0 3.05us 2.06us 1.51x
BM_EltwiseFMAModAVX512IFMA/1024/1 0.268us 0.177us 1.51x
BM_EltwiseFMAModAVX512IFMA/8192/1 2.22us 1.47us 1.51x
BM_EltwiseFMAModAVX512IFMA/16384/1 4.43us 2.93us 1.51x

@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:01 Inactive
@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:01 Inactive
@fboemer fboemer marked this pull request as ready for review August 11, 2021 18:02
@fboemer fboemer requested a review from a team as a code owner August 11, 2021 18:02
@fboemer fboemer marked this pull request as draft August 11, 2021 18:02
@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:03 Inactive
@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:03 Inactive
@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:03 Inactive
@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:03 Inactive
@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:04 Inactive
@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:04 Inactive
@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:04 Inactive
@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:04 Inactive
@fboemer fboemer temporarily deployed to intel_workflow August 11, 2021 18:04 Inactive
Copy link
Contributor

@hamishun hamishun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fboemer fboemer marked this pull request as ready for review August 11, 2021 18:18
@fboemer fboemer merged commit 0be1221 into main Aug 11, 2021
@fboemer fboemer deleted the fboemer/eltwise-fma-mod-avx512-speedup branch August 11, 2021 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants