Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peakflops for zen3/4/5 style architectures by issuing 2xFMA and 2xADD simultaneously. #659

Merged
merged 4 commits into from
Dec 19, 2024

Conversation

will-saunders-ukaea
Copy link
Contributor

Introduction

Is this of interest to you? These two benchmarks are intended to achieve peakflop rate on architectures like zen3/4/5 by issuing FMA and ADD instructions at the same time. Example outputs from a zen4 7970X.

Existing peakflops_avx_fma

$ likwid-bench -t peakflops_avx_fma -w D0:2MB
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 64 from 250000 elements (2000000 bytes) to 249856 elements (1998848 bytes)
Allocate: Process running on hwthread 0 (Domain D0) - Vector length 249856/1998848 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: peakflops_avx_fma
--------------------------------------------------------------------------------
Using 1 work groups
Using 64 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
<pinning info omitted for brevity>
--------------------------------------------------------------------------------
Cycles:			12895531040
CPU Clock:		3993978156
Cycle Clock:		3993978156
Time:			3.228744e+00 sec
Iterations:		67108864
Iterations per thread:	1048576
Inner loop executions:	976
Size (Byte):		1998848
Size per thread:	31232
Number of Flops:	7859790151680
MFlops/s:		2434318.53
Data volume (Byte):	2095944040448
MByte/s:		649151.61
Cycles per update:	0.049221
Cycles per cacheline:	0.393767
Loads per update:	1
Stores per update:	0
Load bytes per element:	8
Store bytes per elem.:	0
Instructions:		1244466774048
UOPs:			1178968522752

Proposed peakflops_avx_fma_add

$ likwid-bench -t peakflops_avx_fma_add -w D0:2MB
CMD /usr/bin/gcc -shared -fPIC /tmp/10315/peakflops_avx_fma_add.S -o /tmp/10315/peakflops_avx_fma_add.o
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 64 from 250000 elements (2000000 bytes) to 249856 elements (1998848 bytes)
Allocate: Process running on hwthread 0 (Domain D0) - Vector length 249856/1998848 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: peakflops_avx_fma_add
--------------------------------------------------------------------------------
Using 1 work groups
Using 64 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
<pinning info omitted for brevity>
--------------------------------------------------------------------------------
Cycles:			13278222800
CPU Clock:		3989137177
Cycle Clock:		3989137177
Time:			3.328595e+00 sec
Iterations:		67108864
Iterations per thread:	1048576
Inner loop executions:	976
Size (Byte):		1998848
Size per thread:	31232
Number of Flops:	11789685227520
MFlops/s:		3541940.24
Data volume (Byte):	2095944040448
MByte/s:		629678.26
Cycles per update:	0.050682
Cycles per cacheline:	0.405453
Loads per update:	1
Stores per update:	0
Load bytes per element:	8
Store bytes per elem.:	0
Instructions:		2226940543008
UOPs:			2161442291712
--------------------------------------------------------------------------------

Proposed peakflops_avx512_fma_add

(Expected to be roughly the same performance as the avx2 version on zen4).

$ likwid-bench -t peakflops_avx512_fma_add -w D0:2MB
CMD /usr/bin/gcc -shared -fPIC /tmp/10798/peakflops_avx512_fma_add.S -o /tmp/10798/peakflops_avx512_fma_add.o
Warning: Sanitizing vector length to a multiple of the loop stride 8 and thread count 64 from 250000 elements (2000000 bytes) to 249856 elements (1998848 bytes)
Allocate: Process running on hwthread 0 (Domain D0) - Vector length 249856/1998848 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: peakflops_avx512_fma_add
--------------------------------------------------------------------------------
Using 1 work groups
Using 64 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
<pinning info omitted for brevity>
--------------------------------------------------------------------------------
Cycles:			13100020440
CPU Clock:		3993980471
Cycle Clock:		3993980471
Time:			3.279941e+00 sec
Iterations:		67108864
Iterations per thread:	1048576
Inner loop executions:	488
Size (Byte):		1998848
Size per thread:	31232
Number of Flops:	11789685227520
MFlops/s:		3594480.85
Data volume (Byte):	2095944040448
MByte/s:		639018.82
Cycles per update:	0.050001
Cycles per cacheline:	0.400011
Loads per update:	1
Stores per update:	0
Load bytes per element:	8
Store bytes per elem.:	0
Instructions:		1113470271520
UOPs:			1080721145856
--------------------------------------------------------------------------------

@TomTheBear
Copy link
Member

Thanks for the kernels. I have similar kernels laying around. The biggest problem I see with these kernels is that it is not clear that they target AMD Zen platforms. They are executable on Intel but do not achieve peakflops, likely due to port congestion. likwid-bench misses a filtering mechanism like "only for vendor X arch Y", "only for systems with AVX512 support", etc.

To overcome this issue, please rename them to contain amd or even amd_zen.

@will-saunders-ukaea
Copy link
Contributor Author

To overcome this issue, please rename them to contain amd or even amd_zen.

Done - Does that naming fit with your naming convention?

@TomTheBear TomTheBear merged commit 7fd5696 into RRZE-HPC:master Dec 19, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants