Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable AVX512 ATen kernels for compute-intensive ops #104165

Closed

Conversation

sanchitintel
Copy link
Collaborator

@sanchitintel sanchitintel commented Jun 25, 2023

Summary

Enables AVX512 dispatch by default for some kernels, for which AVX512 performs better than AVX2.
For other kernels, their AVX2 counterparts are used.

Implementation details

REGISTER_DISPATCH should now only be used for non-AVX512 dispatch.
ALSO_REGISTER_AVX512_DISPATCH should be used when AVX512 dispatch should also be done for a kernel.

Benchmarking results with #104655

Raw data at GitHub Gist (Click on Download ZIP)

Op Speedup of AVX512 over AVX2
sigmoid ~27% with FP32
sign ~16.6%
sgn ~15%
sqrt ~4%
cosh ~37%
sinh ~37.5%
acos ~8% with FP32
expm1 ~30% with FP32
log ~2%
log1p ~16%
erfinv ~6% with FP32
LogSigmoid ~33% with FP32
atan2 ~40% with FP32
logaddexp ~24% with FP32
logaddexp2 ~21% with FP32
hypot ~24% with FP32
igamma ~4% with FP32
lgamma ~40% with FP32
igammac 3.5%
gelu ~3% with FP32
glu ~20% with FP32
SiLU ~35% with FP32
Softplus ~33% with FP32
Mish ~36% with FP32
Hardswish ~7% faster with FP32 when tensor can fit in L2 cache
Hardshrink ~8% faster with FP32 when tensor can fit in L2 cache
Softshrink ~10% faster with FP32 when tensor can fit in L2 cache
Hardtanh ~12.5% faster with FP32 when tensor can fit in L2 cache
Hardsigmoid ~7% faster with FP32 when tensor can fit in L2 cache
hypot ~35%
atan2 ~37%
dequantize per channel ~10%

Insights gleaned through collected data (future action-items):

  1. Inplace variants of some ops are faster with AVX512 although the functional variant may be slower for FP32. Will enable AVX512 dispatch for the inplace variants of such kernels.
  2. Almost all BF16 kernels are faster with AVX512, so after PyTorch 2.1 release, will enable AVX512 dispatch for BF16 kernels whose corresponding FP32 kernel doesn't perform well with AVX512.
  3. Some kernels rely on auto-vectorization & might perform better with AVX512 once explicit vectorization would be enabled for them.

Data was collected with 26 physical threads of one socket of Intel Xeon 8371HC. Intel OpenMP & tcmalloc were preloaded.

cc @jgong5 @mingfeima @XiaobingSuper @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 25, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104165

Note: Links to docs will display an error until the docs builds have been completed.

✅ 1 Unrelated Failure

As of commit d5002f4 with merge base 3336aa1 (image):

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jun 25, 2023
@sanchitintel sanchitintel added the release notes: intel release notes category label Jun 25, 2023
@@ -424,6 +424,8 @@ void replication_pad3d_backward_kernel_impl(

} // anonymous namespace

// These kernels are slower with AVX512 than with AVX2.
#ifndef CPU_CAPABILITY_AVX512
// reflection padding
REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to add an default argument to let decide whether this kernel is dispatched to avx512 explicitly ? Probably something like:

// if you want this one to go avx512:
REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl, /* enable_avx512_dispatch */ true);

//  if you want to skip avx512 (use avx2 only), just use the old way:
REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @mingfeima and I think it is a better idea to explicitly turn on AVX512 for particular kernels instead of turning off others since there are more kernels to turn off than those to turn on. This can reduces the lines of code changes.

Copy link
Collaborator Author

@sanchitintel sanchitintel Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @mingfeima @jgong5, I had missed your comments. I had started with such an approach when I first started this task, so that the number of changes in the code could be reduced. It led to symbol resolution issues, so I abandoned it.

Basically, since kernels are compiled separately for each AVX-n capability, if we only register AVX512 dispatch for some kernels, then we would have missing definitions for kernels that wouldn't be dispatched to AVX512.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to always register null AVX512 kernels by default while checking if there are already existing registrations (e.g., via some macro def checks)? The real AVX512 kernels can be registered beforehand.

Copy link
Collaborator Author

@sanchitintel sanchitintel Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your inputs, @jgong5! Yes, such an approach is feasible.

The approach below to improve readability, and decrease the number of changes is also similar in nature to @mingfeima 's suggestion above (edited in-place below).

// if you want this one to go avx512:
ALSO_REGISTER_AVX512_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);

//  if you want to skip avx512 (use avx2 only), just use the old way:
REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);

@jgong5, its implementation is a bit different from your original suggestion, though.
In the code snippet below, we are leveraging the fact that we compile separately for each AVX-n vectorization ISA -

In DispatchStub.h,

#elif defined(CPU_CAPABILITY)
#ifdef CPU_CAPABILITY_AVX512
// REGISTER_DISPATCH now dispatches an AVX512 kernel to nullptr
// ALSO_REGISTER_AVX512_DISPATCH should be used for ensuring AVX512 dispatch
#define REGISTER_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, nullptr)
#else
#define REGISTER_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn)
#endif
#define ALSO_REGISTER_AVX512_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn)

I'll push changes & then re-request review to seek feedback on it. Thanks!

@@ -424,6 +424,8 @@ void replication_pad3d_backward_kernel_impl(

} // anonymous namespace

// These kernels are slower with AVX512 than with AVX2.
#ifndef CPU_CAPABILITY_AVX512
// reflection padding
REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @mingfeima and I think it is a better idea to explicitly turn on AVX512 for particular kernels instead of turning off others since there are more kernels to turn off than those to turn on. This can reduces the lines of code changes.

@sanchitintel

This comment was marked as off-topic.

Disable AVX512 dispatch for fmod & remainder
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

@sanchitintel

This comment was marked as resolved.

@sanchitintel sanchitintel marked this pull request as ready for review August 1, 2023 06:05
@sanchitintel
Copy link
Collaborator Author

sanchitintel commented Aug 1, 2023

Hi @mingfeima, I'll add benchmarking data soon.
Can you please take a look at the new implementation in the meantime? Thanks!

@jgong5, I was wondering if we should rename the macro REGISTER_DISPATCH for x86_64 machines to REGISTER_NON_AVX512_DISPATCH to prevent any confusion, since I added the macro ALSO_REGISTER_AVX512_DISPATCH in this PR.

@lezcano lezcano removed their request for review August 1, 2023 08:09
@IvanYashchuk IvanYashchuk removed their request for review August 4, 2023 17:16
@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 14, 2023
Copy link
Collaborator

@mingfeima mingfeima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@sanchitintel sanchitintel added the intel This tag is for PR from Intel label Aug 21, 2023
@sanchitintel
Copy link
Collaborator Author

sanchitintel commented Aug 21, 2023

Hi @ezyang,
This PR modifies aten/src/ATen/native/DispatchStub.h and aten/src/ATen/native/DispatchStub.cpp.
Can you please help review it, since these files seem to require approval from a core-maintainer/core-reviewer for merging? Alternatively, can you please suggest reviewers? Thank you!

@sanchitintel
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: intel release notes category release notes: quantization release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants