Peakflops for zen3/4/5 style architectures by issuing 2xFMA and 2xADD simultaneously. #659

will-saunders-ukaea · 2024-12-19T11:12:57Z

Introduction

Is this of interest to you? These two benchmarks are intended to achieve peakflop rate on architectures like zen3/4/5 by issuing FMA and ADD instructions at the same time. Example outputs from a zen4 7970X.

Existing `peakflops_avx_fma`

$ likwid-bench -t peakflops_avx_fma -w D0:2MB
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 64 from 250000 elements (2000000 bytes) to 249856 elements (1998848 bytes)
Allocate: Process running on hwthread 0 (Domain D0) - Vector length 249856/1998848 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: peakflops_avx_fma
--------------------------------------------------------------------------------
Using 1 work groups
Using 64 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
<pinning info omitted for brevity>
--------------------------------------------------------------------------------
Cycles:			12895531040
CPU Clock:		3993978156
Cycle Clock:		3993978156
Time:			3.228744e+00 sec
Iterations:		67108864
Iterations per thread:	1048576
Inner loop executions:	976
Size (Byte):		1998848
Size per thread:	31232
Number of Flops:	7859790151680
MFlops/s:		2434318.53
Data volume (Byte):	2095944040448
MByte/s:		649151.61
Cycles per update:	0.049221
Cycles per cacheline:	0.393767
Loads per update:	1
Stores per update:	0
Load bytes per element:	8
Store bytes per elem.:	0
Instructions:		1244466774048
UOPs:			1178968522752

Proposed `peakflops_avx_fma_add`

$ likwid-bench -t peakflops_avx_fma_add -w D0:2MB
CMD /usr/bin/gcc -shared -fPIC /tmp/10315/peakflops_avx_fma_add.S -o /tmp/10315/peakflops_avx_fma_add.o
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 64 from 250000 elements (2000000 bytes) to 249856 elements (1998848 bytes)
Allocate: Process running on hwthread 0 (Domain D0) - Vector length 249856/1998848 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: peakflops_avx_fma_add
--------------------------------------------------------------------------------
Using 1 work groups
Using 64 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
<pinning info omitted for brevity>
--------------------------------------------------------------------------------
Cycles:			13278222800
CPU Clock:		3989137177
Cycle Clock:		3989137177
Time:			3.328595e+00 sec
Iterations:		67108864
Iterations per thread:	1048576
Inner loop executions:	976
Size (Byte):		1998848
Size per thread:	31232
Number of Flops:	11789685227520
MFlops/s:		3541940.24
Data volume (Byte):	2095944040448
MByte/s:		629678.26
Cycles per update:	0.050682
Cycles per cacheline:	0.405453
Loads per update:	1
Stores per update:	0
Load bytes per element:	8
Store bytes per elem.:	0
Instructions:		2226940543008
UOPs:			2161442291712
--------------------------------------------------------------------------------

Proposed `peakflops_avx512_fma_add`

(Expected to be roughly the same performance as the avx2 version on zen4).

$ likwid-bench -t peakflops_avx512_fma_add -w D0:2MB
CMD /usr/bin/gcc -shared -fPIC /tmp/10798/peakflops_avx512_fma_add.S -o /tmp/10798/peakflops_avx512_fma_add.o
Warning: Sanitizing vector length to a multiple of the loop stride 8 and thread count 64 from 250000 elements (2000000 bytes) to 249856 elements (1998848 bytes)
Allocate: Process running on hwthread 0 (Domain D0) - Vector length 249856/1998848 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: peakflops_avx512_fma_add
--------------------------------------------------------------------------------
Using 1 work groups
Using 64 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
<pinning info omitted for brevity>
--------------------------------------------------------------------------------
Cycles:			13100020440
CPU Clock:		3993980471
Cycle Clock:		3993980471
Time:			3.279941e+00 sec
Iterations:		67108864
Iterations per thread:	1048576
Inner loop executions:	488
Size (Byte):		1998848
Size per thread:	31232
Number of Flops:	11789685227520
MFlops/s:		3594480.85
Data volume (Byte):	2095944040448
MByte/s:		639018.82
Cycles per update:	0.050001
Cycles per cacheline:	0.400011
Loads per update:	1
Stores per update:	0
Load bytes per element:	8
Store bytes per elem.:	0
Instructions:		1113470271520
UOPs:			1080721145856
--------------------------------------------------------------------------------

TomTheBear · 2024-12-19T12:42:13Z

Thanks for the kernels. I have similar kernels laying around. The biggest problem I see with these kernels is that it is not clear that they target AMD Zen platforms. They are executable on Intel but do not achieve peakflops, likely due to port congestion. likwid-bench misses a filtering mechanism like "only for vendor X arch Y", "only for systems with AVX512 support", etc.

To overcome this issue, please rename them to contain amd or even amd_zen.

…fma_add.ptt

will-saunders-ukaea · 2024-12-19T12:48:50Z

To overcome this issue, please rename them to contain amd or even amd_zen.

Done - Does that naming fit with your naming convention?

will-saunders-ukaea added 3 commits December 13, 2024 16:32

Added peakflops_avx_fma_add.ptt for Zen3/4 like architectures

e8f2db8

Added additional metric information to x86-64/peakflops_avx_fma_add.

10064f2

Added peakflops_avx512_fma_add.ptt for Zen4/5 like architectures.

e1983d4

renamed peakflops_avx(512)_fma_add.ptt to peakflops_amd_zen_avx(512)_…

15257e7

…fma_add.ptt

TomTheBear merged commit 7fd5696 into RRZE-HPC:master Dec 19, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peakflops for zen3/4/5 style architectures by issuing 2xFMA and 2xADD simultaneously. #659

Peakflops for zen3/4/5 style architectures by issuing 2xFMA and 2xADD simultaneously. #659

will-saunders-ukaea commented Dec 19, 2024

TomTheBear commented Dec 19, 2024

will-saunders-ukaea commented Dec 19, 2024

Peakflops for zen3/4/5 style architectures by issuing 2xFMA and 2xADD simultaneously. #659

Peakflops for zen3/4/5 style architectures by issuing 2xFMA and 2xADD simultaneously. #659

Conversation

will-saunders-ukaea commented Dec 19, 2024

Introduction

Existing peakflops_avx_fma

Proposed peakflops_avx_fma_add

Proposed peakflops_avx512_fma_add

TomTheBear commented Dec 19, 2024

will-saunders-ukaea commented Dec 19, 2024

Existing `peakflops_avx_fma`

Proposed `peakflops_avx_fma_add`

Proposed `peakflops_avx512_fma_add`