Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a document for leveraging Advanced Matrix Extensions #2439

Merged
merged 18 commits into from
Jun 13, 2023
Merged
122 changes: 122 additions & 0 deletions recipes_source/amx.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
==============================================
Leverage Advanced Matrix Extensions
==============================================

Introduction
============

Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an x86 extension,
which introduce two new components: a 2-dimensional register file called 'tiles' and an accelerator of Tile Matrix Multiplication (TMUL) that is able to operate on those tiles.
AMX is designed to work on matrices to accelerate deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems and image recognition.

Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
Compared to 3rd Gen Intel Xeon Scalable processors running Intel® Advanced Vector Extensions 512 Neural Network Instructions (Intel® AVX-512 VNNI),
4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle, rather than 256 INT8 operations per cycle. They can also perform 1,024 BF16 operations per cycle, as compared to 64 FP32 operations per cycle, see page 4 of `Accelerate AI Workloads with Intel® AMX`_.
For more detailed information of AMX, see `Intel® AMX Overview`_.


AMX in PyTorch
==============

PyTorch leverages AMX for computing intensive operators with BFloat16 and quantization with INT8 by its backend oneDNN
to get higher performance out-of-box on x86 CPUs with AMX support.
For more detailed information of oneDNN, see `oneDNN`_.

The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
The operation is fully handled by oneDNN according to the execution code path generated. For example, when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.
Fixed.

Since oneDNN is the default acceleration library for PyTorch CPU, no manual operations are required to enable the AMX support.

Guidelines of leveraging AMX with workloads
-------------------------------------------

svekars marked this conversation as resolved.
Show resolved Hide resolved
- BFloat16 data type:

Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.

::

model = model.to(memory_format=torch.channels_last)
with torch.cpu.amp.autocast():
output = model(input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- BFloat16 data type:
Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.
::
model = model.to(memory_format=torch.channels_last)
with torch.cpu.amp.autocast():
output = model(input)
- BFloat16 data type:
- Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.
::
model = model.to(memory_format=torch.channels_last)
with torch.cpu.amp.autocast():
output = model(input)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


Note: Use channels last format to get better performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note: Use channels last format to get better performance.
.. note:: Use channels' last format to get better performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


- Quantization:

Applying quantization would utilize AMX acceleration for supported operators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Applying quantization would utilize AMX acceleration for supported operators.
- Applying quantization would utilize AMX acceleration for supported operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


- torch.compile:

When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
- When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


Note: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.
.. note:: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

This means that PyTorch will attempt to leverage the AMX feature whenever possible to speed up matrix multiplication operations.
However, it's important to note that the decision to dispatch to the AMX kernel ultimately depends on
the internal optimization strategy of the oneDNN library and the quantization backend, which PyTorch relies on for performance enhancements.
The specific details of how AMX utilization is handled internally by PyTorch and the oneDNN library may be subject to change with updates and improvements to the framework.


CPU operators that can leverage AMX:
------------------------------------

- BF16 CPU ops that can leverage AMX:

``conv1d``,
``conv2d``,
``conv3d``,
``conv_transpose1d``,
``conv_transpose2d``,
``conv_transpose3d``,
``bmm``,
``mm``,
``baddbmm``,
``addmm``,
``addbmm``,
``linear``,
``matmul``,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- BF16 CPU ops that can leverage AMX:
``conv1d``,
``conv2d``,
``conv3d``,
``conv_transpose1d``,
``conv_transpose2d``,
``conv_transpose3d``,
``bmm``,
``mm``,
``baddbmm``,
``addmm``,
``addbmm``,
``linear``,
``matmul``,
BF16 CPU ops that can leverage AMX:
- ``conv1d``
- ``conv2d``
- ``conv3d``
- ``conv_transpose1d``
- ``conv_transpose2d``
- ``conv_transpose3d``
- ``bmm``
- ``mm``
- ``baddbmm``
- ``addmm``
- ``addbmm``
- ``linear``
- ``matmul``

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


- Quantization CPU ops that can leverage AMX:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Quantization CPU ops that can leverage AMX:
Quantization CPU ops that can leverage AMX:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


``conv1d``,
``conv2d``,
``conv3d``,
``conv_transpose1d``,
``conv_transpose2d``,
``conv_transpose3d``,
``linear``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need a special note for quantized linear here that whether AMX kernel is chosen also depends on the policy of the quantization backend. Currently, the x86 quant backend uses fbgemm, not onednn while users can use onednn backend to turn on AMX for linear op. cc @Xia-Weiwen

In general, it is also true that whether to dispatch to AMX kernels is a backend/library choice. The backend/library would choose the most optimal kernels. It is worth noting in this tutorial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add note.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. However, I am not sure if it's OK to give such details in tutorial. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``conv1d``,
``conv2d``,
``conv3d``,
``conv_transpose1d``,
``conv_transpose2d``,
``conv_transpose3d``,
``linear``
- ``conv1d``
- ``conv2d``
- ``conv3d``
- ``conv_transpose1d``
- ``conv_transpose2d``
- ``conv_transpose3d``
- ``linear``

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed




Confirm AMX is being utilized
------------------------------

Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.
Set environment variable to ``export ONEDNN_VERBOSE=1`` or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to should not be added here because the specific environment variable we want to use here is ONEDNN_VERBOSE, whose value we set to 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. I will keep the original version for this sentence.


::

with torch.backends.mkldnn.verbose(torch.backends.mkldnn.VERBOSE_ON):
with torch.cpu.amp.autocast():
model(input)

For example, get oneDNN verbose:

::

onednn_verbose,info,oneDNN v2.7.3 (commit 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:128
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a:f0 dst_f32::blocked:a:f0,attr-scratchpad:user ,,2,5.2561
...
onednn_verbose,exec,cpu,convolution,jit:avx512_core_amx_bf16,forward_training,src_bf16::blocked:acdb:f0 wei_bf16:p:blocked:ABcd16b16a2b:f0 bia_f32::blocked:a:f0 dst_bf16::blocked:acdb:f0,attr-scratchpad:user ,alg:convolution_direct,mb7_ic2oc1_ih224oh111kh3sh2dh1ph1_iw224ow111kw3sw2dw1pw1,0.628906
...
onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
...

If we get the verbose of ``avx512_core_amx_bf16`` for BFloat16 or ``avx512_core_amx_int8`` for quantization with INT8, it indicates that AMX is activated.

.. _Accelerate AI Workloads with Intel® AMX: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html

.. _Intel® AMX Overview: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html

.. _oneDNN: https://oneapi-src.github.io/oneDNN/index.html
9 changes: 9 additions & 0 deletions recipes_source/recipes_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,15 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
:link: ../recipes/recipes/tuning_guide.html
:tags: Model-Optimization

.. Leverage Advanced Matrix Extensions

.. customcarditem::
:header: Leverage Advanced Matrix Extensions
:card_description: Learn to leverage Advanced Matrix Extensions.
:image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
:link: ../recipes/amx.html
:tags: Model-Optimization

.. Intel(R) Extension for PyTorch*

.. customcarditem::
Expand Down