Add a document for leveraging Advanced Matrix Extensions #2439

CaoE · 2023-06-07T06:51:40Z

Description

Add a document about how to leverage AMX with PyTorch on the 4th Gen of Xeon.

Checklist

The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
Only one issue is addressed in this pull request
Labels from the issue that this PR is fixing are added to this pull request
No unnecessary issues are included into this pull request.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ZailiWang @ZhaoqiongZ @leslie-fang-intel @Xia-Weiwen @sekahler2 @zhuhaozhe @Valentine233

pytorch-bot · 2023-06-07T06:51:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2439

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 52c96e2:

NEW FAILURE - The following job has failed:

pytorch_tutorial_build_worker (6, 6, linux.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

netlify · 2023-06-07T06:55:56Z

✅ Deploy Preview for pytorch-tutorials-preview ready!

Name	Link
🔨 Latest commit	`52c96e2`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/6488c820e404990008cb834f
😎 Deploy Preview	https://deploy-preview-2439--pytorch-tutorials-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

jgong5 · 2023-06-07T08:11:37Z

recipes_source/amx.rst

+For more detailed information of oneDNN, see `oneDNN`_.
+
+The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
+No manual operations are required to enable this feature. 


Suggested change

No manual operations are required to enable this feature.

Since oneDNN is the default acceleration library for CPU, no manual operations are required to enable the AMX support.

Apply the change.

jgong5 · 2023-06-07T08:14:24Z

recipes_source/amx.rst

+``conv_transpose1d``,
+``conv_transpose2d``,
+``conv_transpose3d``,
+``linear``


I guess we need a special note for quantized linear here that whether AMX kernel is chosen also depends on the policy of the quantization backend. Currently, the x86 quant backend uses fbgemm, not onednn while users can use onednn backend to turn on AMX for linear op. cc @Xia-Weiwen

In general, it is also true that whether to dispatch to AMX kernels is a backend/library choice. The backend/library would choose the most optimal kernels. It is worth noting in this tutorial.

Yes. However, I am not sure if it's OK to give such details in tutorial. 🤔

msaroufim · 2023-06-07T16:18:33Z

recipes_source/amx.rst

+
+Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).
+Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
+AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively, see figure 6 of `Accelerate AI Workloads with Intel® AMX`_.


Are the speedups only on some particular newer hardware? Is the hardware consumer or enterprise centric?

AMX is only available from the 4th gen of Xeon (codename sapphire rapids), it is enterprise centric.

msaroufim · 2023-06-07T16:26:19Z

recipes_source/amx.rst

+Confirm AMX is being utilized
+------------------------------
+
+Set environment variable ``export ONEDNN_VERBOSE=1`` to get oneDNN verbose at runtime.


it would be nice to have some python function like is_x_available()

Add a python function torch.backends.mkldnn.verbose.

msaroufim · 2023-06-07T16:27:43Z

recipes_source/amx.rst

+Note: For quantized linear, whether to leverage AMX depends on which quantization backend to choose.
+At present, x86 quantization backend is used by default for quantized linear, using fbgemm, while users can specify onednn backend to turn on AMX for quantized linear.
+
+Guidelines of leveraging AMX with workloads


I would start with this section on how to use it and have the supported ops show up at the bottm

Thanks for your suggestion. Do you mean to move this section above the supported ops ? Like this：
AMX in PyTorch
Guidelines of leveraging AMX with workloads
List supported ops
...

msaroufim · 2023-06-07T16:29:11Z

recipes_source/amx.rst

+Introduction
+============
+
+Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).


I realize AMX is lower level than other Intel technologies but it's still worth rationalizing to an end user in a few lines why it's interesting for them to know about AMX vs Intel compiler technologies

Added more introduction to AMX and the benefits it can bring.

mingfeima · 2023-06-08T01:31:29Z

recipes_source/amx.rst

+
+Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).
+Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
+AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively, see figure 6 of `Accelerate AI Workloads with Intel® AMX`_.


we can directly copy the wording from

Compared to 3rd Gen Intel Xeon Scalable processors running Intel® Advanced Vector Extensions 512 Neural Network Instructions (Intel® AVX-512 VNNI), 4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle, rather than 256 INT8 operations per cycle. They can also perform 1,024 BF16 operations per cycle, as compared to 64 FP32 operations per cycle.

which is a quote from https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html

Quoted this.

mingfeima · 2023-06-08T01:32:58Z

recipes_source/amx.rst

+``conv1d``,
+``conv2d``,
+``conv3d``,
+``conv1d``,


why we have 2 sets of conv1d, conv2d, conv3d here

Fixed typos.

mingfeima · 2023-06-08T01:33:43Z

recipes_source/amx.rst

+``addbmm``,
+``linear``,
+``matmul``,
+``_convolution``


_convolution is not intended to be directly used, start with a _

Removed _convolution .

jgong5

Please add a "summary" or "conclusion" section to summarize the document.

CaoE · 2023-06-08T09:27:15Z

Please add a "summary" or "conclusion" section to summarize the document.

Added conclusion section.

CaoE · 2023-06-09T01:56:08Z

@msaroufim Could you please review this doc ? Thanks.

CaoE · 2023-06-09T09:01:28Z

@ngimel Could you please review this doc ? Thank you.

CaoE · 2023-06-10T02:20:32Z

@msaroufim Could you please review this doc ? Thank you.

CaoE · 2023-06-10T02:29:41Z

@kit1980 Could you please review this doc ? Thank you.

svekars

A couple editorial fixes for proper HTML rendering.

svekars · 2023-06-12T19:26:10Z

recipes_source/amx.rst

+to get higher performance out-of-box on x86 CPUs with AMX support.
+For more detailed information of oneDNN, see `oneDNN`_.
+
+The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.


Suggested change

The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.

The operation is fully handled by oneDNN according to the execution code path generated. For example, when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.

Thanks for your comments.
Fixed.

svekars · 2023-06-12T19:28:37Z

recipes_source/amx.rst

+   with torch.cpu.amp.autocast():
+       output = model(input)
+
+Note: Use channels last format to get better performance. 


Suggested change

Note: Use channels last format to get better performance.

.. note:: Use channels' last format to get better performance.

svekars · 2023-06-12T19:29:12Z

recipes_source/amx.rst

+
+When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
+
+Note: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.


Suggested change

Note: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.

.. note:: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.

svekars · 2023-06-12T19:31:44Z

recipes_source/amx.rst

+- BF16 CPU ops that can leverage AMX:
+
+``conv1d``,
+``conv2d``,
+``conv3d``,
+``conv_transpose1d``,
+``conv_transpose2d``,
+``conv_transpose3d``,
+``bmm``,
+``mm``,
+``baddbmm``,
+``addmm``,
+``addbmm``,
+``linear``,
+``matmul``,


Suggested change

- BF16 CPU ops that can leverage AMX:

``conv1d``,

``conv2d``,

``conv3d``,

``conv_transpose1d``,

``conv_transpose2d``,

``conv_transpose3d``,

``bmm``,

``mm``,

``baddbmm``,

``addmm``,

``addbmm``,

``linear``,

``matmul``,

BF16 CPU ops that can leverage AMX:

- ``conv1d``

- ``conv2d``

- ``conv3d``

- ``conv_transpose1d``

- ``conv_transpose2d``

- ``conv_transpose3d``

- ``bmm``

- ``mm``

- ``baddbmm``

- ``addmm``

- ``addbmm``

- ``linear``

- ``matmul``

svekars · 2023-06-12T19:31:54Z

recipes_source/amx.rst

+``linear``,
+``matmul``,
+
+- Quantization CPU ops that can leverage AMX:


Suggested change

- Quantization CPU ops that can leverage AMX:

Quantization CPU ops that can leverage AMX:

svekars · 2023-06-12T19:33:53Z

recipes_source/amx.rst

+Confirm AMX is being utilized
+------------------------------
+
+Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.


Suggested change

Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.

Set environment variable to ``export ONEDNN_VERBOSE=1`` or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.

I think to should not be added here because the specific environment variable we want to use here is ONEDNN_VERBOSE, whose value we set to 1.

Thanks for your comments. I will keep the original version for this sentence.

recipes_source/amx.rst

svekars · 2023-06-12T19:38:25Z

recipes_source/amx.rst

+- BFloat16 data type: 
+
+Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.
+
+::
+
+   model = model.to(memory_format=torch.channels_last)
+   with torch.cpu.amp.autocast():
+       output = model(input)


Suggested change

- BFloat16 data type:

Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.

::

model = model.to(memory_format=torch.channels_last)

with torch.cpu.amp.autocast():

output = model(input)

- BFloat16 data type:

- Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.

::

model = model.to(memory_format=torch.channels_last)

with torch.cpu.amp.autocast():

output = model(input)

svekars · 2023-06-12T19:39:16Z

recipes_source/amx.rst

+
+- Quantization:
+
+Applying quantization would utilize AMX acceleration for supported operators.


Suggested change

Applying quantization would utilize AMX acceleration for supported operators.

- Applying quantization would utilize AMX acceleration for supported operators.

svekars · 2023-06-12T19:39:38Z

recipes_source/amx.rst

+
+- torch.compile:
+
+When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.


Suggested change

When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.

- When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.

CaoE added 3 commits June 6, 2023 06:01

[Doc] Add AMX document for oneDNN backend

bbbb580

Update AMX document

7945bdb

Update AMX document

a2b148c

facebook-github-bot added the cla signed label Jun 7, 2023

github-actions bot added docathon-h1-2023 A label for the docathon in H1 2023 advanced intel and removed cla signed labels Jun 7, 2023

CaoE changed the title ~~Add amx doc~~ Add a document for leveraging Advanced Matrix Extensions Jun 7, 2023

facebook-github-bot added the cla signed label Jun 7, 2023

jgong5 reviewed Jun 7, 2023

View reviewed changes

CaoE added 2 commits June 7, 2023 02:46

Update AMX document

1ae03c7

Merge branch 'main' into add_amx_doc

22cd08a

github-actions bot removed the cla signed label Jun 7, 2023

facebook-github-bot added the cla signed label Jun 7, 2023

msaroufim reviewed Jun 7, 2023

View reviewed changes

msaroufim requested changes Jun 7, 2023

View reviewed changes

Update AMX document

3776101

github-actions bot removed the cla signed label Jun 8, 2023

mingfeima suggested changes Jun 8, 2023

View reviewed changes

facebook-github-bot added the cla signed label Jun 8, 2023

CaoE added 2 commits June 7, 2023 19:56

Update AMX document

d3d0aae

Merge branch 'main' into add_amx_doc

7baf4b5

github-actions bot removed the cla signed label Jun 8, 2023

facebook-github-bot added the cla signed label Jun 8, 2023

CaoE requested review from jgong5 and Xia-Weiwen June 8, 2023 03:09

jgong5 approved these changes Jun 8, 2023

View reviewed changes

add conclusion

a608d00

github-actions bot removed the cla signed label Jun 8, 2023

facebook-github-bot added the cla signed label Jun 8, 2023

CaoE marked this pull request as ready for review June 8, 2023 15:11

Merge branch 'main' into add_amx_doc

215c6e3

github-actions bot removed the cla signed label Jun 8, 2023

facebook-github-bot added the cla signed label Jun 8, 2023

Merge branch 'main' into add_amx_doc

3d01c9a

github-actions bot removed the cla signed label Jun 9, 2023

facebook-github-bot added the cla signed label Jun 9, 2023

Merge branch 'main' into add_amx_doc

88d32df

github-actions bot removed the cla signed label Jun 10, 2023

facebook-github-bot added the cla signed label Jun 10, 2023

svekars reviewed Jun 12, 2023

View reviewed changes

CaoE added 3 commits June 12, 2023 18:38

Editorial fixes for proper HTML rendering

2bbcabf

Merge branch 'main' into add_amx_doc

96c384c

Update AMX document

6d02a18

mingfeima approved these changes Jun 13, 2023

View reviewed changes

Merge branch 'main' into add_amx_doc

622542a

msaroufim approved these changes Jun 13, 2023

View reviewed changes

Merge branch 'main' into add_amx_doc

52c96e2

svekars approved these changes Jun 13, 2023

View reviewed changes

svekars merged commit f87d5aa into pytorch:main Jun 13, 2023

ZailiWang mentioned this pull request Jun 15, 2023

To add brief intro for CPU backend optimization pytorch/pytorch#103666

Closed

	No manual operations are required to enable this feature.
	Since oneDNN is the default acceleration library for CPU, no manual operations are required to enable the AMX support.

	The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
	The operation is fully handled by oneDNN according to the execution code path generated. For example, when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.

	Note: Use channels last format to get better performance.
	.. note:: Use channels' last format to get better performance.


		When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.

		Note: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.

	Note: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.
	.. note:: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.

	- Quantization CPU ops that can leverage AMX:
	Quantization CPU ops that can leverage AMX:

	Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.
	Set environment variable to ``export ONEDNN_VERBOSE=1`` or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.


		- Quantization:

		Applying quantization would utilize AMX acceleration for supported operators.

	Applying quantization would utilize AMX acceleration for supported operators.
	- Applying quantization would utilize AMX acceleration for supported operators.


		- torch.compile:

		When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.

	When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
	- When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.

Add a document for leveraging Advanced Matrix Extensions #2439

Add a document for leveraging Advanced Matrix Extensions #2439

Conversation

CaoE commented Jun 7, 2023 • edited by pytorch-bot bot Loading

Description

Checklist

pytorch-bot bot commented Jun 7, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2439

❌ 1 New Failure

netlify bot commented Jun 7, 2023 • edited Loading

✅ Deploy Preview for pytorch-tutorials-preview ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgong5 left a comment

Choose a reason for hiding this comment

CaoE commented Jun 8, 2023 • edited Loading

CaoE commented Jun 9, 2023

CaoE commented Jun 9, 2023

CaoE commented Jun 10, 2023

CaoE commented Jun 10, 2023

svekars left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaoE commented Jun 7, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 7, 2023 •

edited

Loading

netlify bot commented Jun 7, 2023 •

edited

Loading

CaoE commented Jun 8, 2023 •

edited

Loading