Re-enable AVX512 ATen kernels for compute-intensive ops #104165

sanchitintel · 2023-06-25T21:45:48Z

Summary

Enables AVX512 dispatch by default for some kernels, for which AVX512 performs better than AVX2.
For other kernels, their AVX2 counterparts are used.

Implementation details

REGISTER_DISPATCH should now only be used for non-AVX512 dispatch.
ALSO_REGISTER_AVX512_DISPATCH should be used when AVX512 dispatch should also be done for a kernel.

Benchmarking results with #104655

Raw data at GitHub Gist (Click on Download ZIP)

Op	Speedup of AVX512 over AVX2
sigmoid	~27% with FP32
sign	~16.6%
sgn	~15%
sqrt	~4%
cosh	~37%
sinh	~37.5%
acos	~8% with FP32
expm1	~30% with FP32
log	~2%
log1p	~16%
erfinv	~6% with FP32
LogSigmoid	~33% with FP32
atan2	~40% with FP32
logaddexp	~24% with FP32
logaddexp2	~21% with FP32
hypot	~24% with FP32
igamma	~4% with FP32
lgamma	~40% with FP32
igammac	3.5%
gelu	~3% with FP32
glu	~20% with FP32
SiLU	~35% with FP32
Softplus	~33% with FP32
Mish	~36% with FP32
Hardswish	~7% faster with FP32 when tensor can fit in L2 cache
Hardshrink	~8% faster with FP32 when tensor can fit in L2 cache
Softshrink	~10% faster with FP32 when tensor can fit in L2 cache
Hardtanh	~12.5% faster with FP32 when tensor can fit in L2 cache
Hardsigmoid	~7% faster with FP32 when tensor can fit in L2 cache
hypot	~35%
atan2	~37%
dequantize per channel	~10%

Insights gleaned through collected data (future action-items):

Inplace variants of some ops are faster with AVX512 although the functional variant may be slower for FP32. Will enable AVX512 dispatch for the inplace variants of such kernels.
Almost all BF16 kernels are faster with AVX512, so after PyTorch 2.1 release, will enable AVX512 dispatch for BF16 kernels whose corresponding FP32 kernel doesn't perform well with AVX512.
Some kernels rely on auto-vectorization & might perform better with AVX512 once explicit vectorization would be enabled for them.

Data was collected with 26 physical threads of one socket of Intel Xeon 8371HC. Intel OpenMP & tcmalloc were preloaded.

cc @jgong5 @mingfeima @XiaobingSuper @ashokei @jingxu10

…e 'if'

pytorch-bot · 2023-06-25T21:45:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104165

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ 1 Unrelated Failure

As of commit d5002f4 with merge base 3336aa1 ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

linux-focal-rocm5.6-py3.8 / test (default, 1, 3, linux.rocm.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…tintel/reenable_avx512

mingfeima · 2023-07-07T06:18:00Z

aten/src/ATen/native/cpu/PaddingKernel.cpp

@@ -424,6 +424,8 @@ void replication_pad3d_backward_kernel_impl(

 } // anonymous namespace

+// These kernels are slower with AVX512 than with AVX2.
+#ifndef CPU_CAPABILITY_AVX512
 // reflection padding
 REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);


is it possible to add an default argument to let decide whether this kernel is dispatched to avx512 explicitly ? Probably something like:

// if you want this one to go avx512： REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl, /* enable_avx512_dispatch */ true); // if you want to skip avx512 (use avx2 only), just use the old way： REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);

Agree with @mingfeima and I think it is a better idea to explicitly turn on AVX512 for particular kernels instead of turning off others since there are more kernels to turn off than those to turn on. This can reduces the lines of code changes.

Sorry, @mingfeima @jgong5, I had missed your comments. I had started with such an approach when I first started this task, so that the number of changes in the code could be reduced. It led to symbol resolution issues, so I abandoned it.

Basically, since kernels are compiled separately for each AVX-n capability, if we only register AVX512 dispatch for some kernels, then we would have missing definitions for kernels that wouldn't be dispatched to AVX512.

Is it possible to always register null AVX512 kernels by default while checking if there are already existing registrations (e.g., via some macro def checks)? The real AVX512 kernels can be registered beforehand.

Thanks again for your inputs, @jgong5! Yes, such an approach is feasible.

The approach below to improve readability, and decrease the number of changes is also similar in nature to @mingfeima 's suggestion above (edited in-place below).

// if you want this one to go avx512： ALSO_REGISTER_AVX512_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl); // if you want to skip avx512 (use avx2 only), just use the old way： REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);

@jgong5, its implementation is a bit different from your original suggestion, though.
In the code snippet below, we are leveraging the fact that we compile separately for each AVX-n vectorization ISA -

In DispatchStub.h,

#elif defined(CPU_CAPABILITY) #ifdef CPU_CAPABILITY_AVX512 // REGISTER_DISPATCH now dispatches an AVX512 kernel to nullptr // ALSO_REGISTER_AVX512_DISPATCH should be used for ensuring AVX512 dispatch #define REGISTER_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, nullptr) #else #define REGISTER_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn) #endif #define ALSO_REGISTER_AVX512_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn)

I'll push changes & then re-request review to seek feedback on it. Thanks!

jgong5 · 2023-07-12T23:29:52Z

aten/src/ATen/native/cpu/PaddingKernel.cpp

@@ -424,6 +424,8 @@ void replication_pad3d_backward_kernel_impl(

 } // anonymous namespace

+// These kernels are slower with AVX512 than with AVX2.
+#ifndef CPU_CAPABILITY_AVX512
 // reflection padding
 REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);


Agree with @mingfeima and I think it is a better idea to explicitly turn on AVX512 for particular kernels instead of turning off others since there are more kernels to turn off than those to turn on. This can reduces the lines of code changes.

Disable AVX512 dispatch for fmod & remainder

jgong5

LGTM now.

sanchitintel · 2023-08-01T06:10:15Z

Hi @mingfeima, I'll add benchmarking data soon.
Can you please take a look at the new implementation in the meantime? Thanks!

@jgong5, I was wondering if we should rename the macro REGISTER_DISPATCH for x86_64 machines to REGISTER_NON_AVX512_DISPATCH to prevent any confusion, since I added the macro ALSO_REGISTER_AVX512_DISPATCH in this PR.

mingfeima

LGTM :)

Will add a follow-up items to the RFC

sanchitintel · 2023-08-21T17:58:50Z

Hi @ezyang,
This PR modifies aten/src/ATen/native/DispatchStub.h and aten/src/ATen/native/DispatchStub.cpp.
Can you please help review it, since these files seem to require approval from a core-maintainer/core-reviewer for merging? Alternatively, can you please suggest reviewers? Thank you!

sanchitintel · 2023-08-22T01:52:42Z

@pytorchbot merge

pytorchmergebot · 2023-08-22T01:54:38Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[skip ci] compilation fails with error: expected unqualified-id befor…

7b2cf2c

…e 'if'

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jun 25, 2023

pytorchbot added the open source label Jun 25, 2023

sanchitintel added the release notes: intel release notes category label Jun 25, 2023

sanchitintel added 6 commits June 26, 2023 11:38

Duplicate macros for nullptr dispatch

7544939

Disable more AVX512 kernels

b856ebe

Disable more AVX512 kernels

697173f

Merge branch 'main' of https://github.com/pytorch/pytorch into sanchi…

2aaf056

…tintel/reenable_avx512

Enable AVX512 by default

621804d

Disable more AVX512 kernels

4e691bd

github-actions bot added the release notes: quantization release notes category label Jun 28, 2023

sanchitintel mentioned this pull request Jul 5, 2023

RFC: Enabling AVX512 dispatch for compute-intensive ATen ops #94322

Open

5 tasks

mingfeima requested changes Jul 7, 2023

View reviewed changes

jgong5 requested changes Jul 12, 2023

View reviewed changes

sanchitintel added 3 commits July 20, 2023 13:49

Reduce changes by using a new macro ALSO_REGISTER_AVX512_DISPATCH

218386b

Merge branch 'pytorch:main' into sanchitintel/reenable_avx512

c8c266d

Delete activation_benchmark.py as it should be in another PR

27bab64

sanchitintel requested review from mingfeima and jgong5 July 21, 2023 04:57

This comment was marked as off-topic.

Sign in to view

Disable AVX512 dispatch for fmod & remainder

d2ea43d

Disable AVX512 dispatch for fmod & remainder

jgong5 approved these changes Jul 24, 2023

View reviewed changes

This comment was marked as resolved.

Sign in to view

sanchitintel added 2 commits July 31, 2023 23:01

Some Distributions' kernels have not been vectorized with Vec

ea34462

Merge branch 'pytorch:main' into sanchitintel/reenable_avx512

974e4bc

sanchitintel marked this pull request as ready for review August 1, 2023 06:05

sanchitintel requested review from lezcano, nikitaved and IvanYashchuk as code owners August 1, 2023 06:05

lezcano removed their request for review August 1, 2023 08:09

sanchitintel added 2 commits August 2, 2023 23:28

Remove AVX512 dispatch for some kernels

9dfab67

Fix lint & refactor to reduce no. of lines

2e3704f

IvanYashchuk removed their request for review August 4, 2023 17:16

sanchitintel added 2 commits August 9, 2023 16:34

Merge branch 'pytorch:main' into sanchitintel/reenable_avx512

15def40

Being more conservative in enabling AVX512 dispatch

29c1d4a

sanchitintel marked this pull request as ready for review August 9, 2023 21:38

albanD mentioned this pull request Aug 10, 2023

Case study of torch.compile / cpp inductor on CPU: min_sum / mul_sum with 1d / matmul-like with static / dynamic shapes #106614

Open

sanchitintel added 2 commits August 13, 2023 16:52

Disable some AVX512 kernels

f20d940

Remove AVX512 dispatch for nan_to_num

5e89355

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 14, 2023

mingfeima approved these changes Aug 18, 2023

View reviewed changes

sanchitintel added 4 commits August 21, 2023 10:28

[skip ci] Disable AVX512 dispatch for unvectorized unary ops' kernels

ce922b6

Will add a follow-up items to the RFC

Disable AVX512 dispatch for unvectorized binary ops' kernels

8d5d5ed

Merge branch 'pytorch:main' into sanchitintel/reenable_avx512

a4ffdca

Revise comments

27f35d5

sanchitintel added the intel This tag is for PR from Intel label Aug 21, 2023

sanchitintel added 2 commits August 21, 2023 11:03

Fix typo in comment

e9eed90

Add AVX512 dispatch for FlashAttentionKernel

d5002f4

kit1980 approved these changes Aug 21, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2023

pytorchmergebot added the merging label Aug 22, 2023

pytorchmergebot added Merged and removed merging labels Aug 22, 2023

pytorchmergebot closed this in 8ed169b Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable AVX512 ATen kernels for compute-intensive ops #104165

Re-enable AVX512 ATen kernels for compute-intensive ops #104165

sanchitintel commented Jun 25, 2023 •

edited

Loading

pytorch-bot bot commented Jun 25, 2023 •

edited

Loading

mingfeima Jul 7, 2023

jgong5 Jul 12, 2023

sanchitintel Jul 19, 2023 •

edited

Loading

jgong5 Jul 20, 2023

sanchitintel Jul 20, 2023 •

edited

Loading

jgong5 Jul 12, 2023

This comment was marked as off-topic.

jgong5 left a comment

This comment was marked as resolved.

sanchitintel commented Aug 1, 2023 •

edited

Loading

mingfeima left a comment

sanchitintel commented Aug 21, 2023 •

edited

Loading

sanchitintel commented Aug 22, 2023

pytorchmergebot commented Aug 22, 2023

Re-enable AVX512 ATen kernels for compute-intensive ops #104165

Re-enable AVX512 ATen kernels for compute-intensive ops #104165

Conversation

sanchitintel commented Jun 25, 2023 • edited Loading

Summary

Implementation details

Benchmarking results with #104655

Insights gleaned through collected data (future action-items):

pytorch-bot bot commented Jun 25, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104165

✅ 1 Unrelated Failure

mingfeima Jul 7, 2023

Choose a reason for hiding this comment

jgong5 Jul 12, 2023

Choose a reason for hiding this comment

sanchitintel Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

jgong5 Jul 20, 2023

Choose a reason for hiding this comment

sanchitintel Jul 20, 2023 • edited Loading

Choose a reason for hiding this comment

jgong5 Jul 12, 2023

Choose a reason for hiding this comment

This comment was marked as off-topic.

jgong5 left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

sanchitintel commented Aug 1, 2023 • edited Loading

mingfeima left a comment

Choose a reason for hiding this comment

sanchitintel commented Aug 21, 2023 • edited Loading

sanchitintel commented Aug 22, 2023

pytorchmergebot commented Aug 22, 2023

Merge started

sanchitintel commented Jun 25, 2023 •

edited

Loading

pytorch-bot bot commented Jun 25, 2023 •

edited

Loading

sanchitintel Jul 19, 2023 •

edited

Loading

sanchitintel Jul 20, 2023 •

edited

Loading

sanchitintel commented Aug 1, 2023 •

edited

Loading

sanchitintel commented Aug 21, 2023 •

edited

Loading