1 Legacy DGEMM implementation

An Alternative SGEMM(DGEMM) on VEGA (MI25, MI50, MI60)

to Verify Power by LDS, SGPR , and Data Forwarding

1 Legacy DGEMM implementation

https://github.com/NervanaSystems/maxas/wiki/SGEMM has a detailed explain of SGEMM on Maxwell Architecture. Most SGEMM/DGEMM implementation on GPU are using similar algorithms. The top level idea of legacy SGEMM/DGEMM are implemented as Following:

Using work group Size (64,1,1)
Each work group computes the matrix C’s region from (m,n), to [m+64, n+63], we calls [64x64] macro-tile for workgroup. In this example, only 64x64 macro tile size is discussed as example.
Each work group will load Matrix A, 64 * K , Matrix B K * 64 , do 64 * K * 64 times of FMA computing
Matrix A and Matrix B will be loaded into LDS,
Every Thread does Matrix computing matrix A= 8xK, Matrix B=Kx8 for SGEMMEvery thread computes matrix C’s 8x8 micro tile size.
For Each workgroup: Matrix A will be read 8xK times from LDS
For Each workgroup: matrix B will be read 8xK times from LDS
For Each workgroup: Matrix A will be read 64xKx64 times from VGPR
For Each workgroup: Matrix B will be read 64xKx64 times from VGPR
For Each workgroup: Matrix C will be read and write 64xKx64 times from VGPR

Memory read/write occupies very high ratio of total power energy. SGEMM/DGEMM computing includes following memory accesses in modern GPUs:

External Video memory Read from GDDR or HBM to L2 Cache
From L2 Cache to L1 Cache
From L1 Cache to LDS
From LDS to VGPR
FMA reads VGPRs only for matrix Sum

In general, LDS/VGPR occupies almost 50% total energy for SGEMM/DGEMM.

2 Very Low Power SGEMM/DGEMM Algorithms for SGEMM

2.1 Macro Tile per Workgroup and Micro-Tile per Thread

The VLP SGEMM uses work group size 64 for macro tile M=64, N=64.

The workgroup size of 128 uses macro tile size M=64, N = 128.

The workgroup size of 256 uses macro tile size M=64, N = 256.

The micro tile size for each thread is M=64 and N =1. Each thread computes Matrix A= 64xK, Matrix B = Kx64, result in Matrix-C 64 x1 .

For 64 threads, the Matrix-C’s address is continuous for each M.

In this paper, the algorithm is based on macro tile size M=64 and N=64 if there is no special notation .

To have best use of Matrix A for SQC constants,

hipBlockIdx_x = N/64
hipBlockIdx_y = M/64

2.2 Matrix A Base Offset Per Wave

Every block has one base address for its Matrix A.

matrix_A_base_offset = hipBlockIdx_y * 64 * lda;

2.3 Matrix B Base Offset per Wave

Every block has one base address for its Matrix B.

matrix_B_base_offset = hipBlockIdx_x * 64 * ldb;

2.4 Matrix A’s Offset for Each K

matrix_A_koffset = k * sizeof (float)

The algorithm reads Matrix A’s data by Assembly Instruction “s_load_dwordx8”

s_load_dwordx8 s[32:39], s[12:13], s18

AMD GCN architects has 96 available SGPRs . This algorithm uses SGPR s32 to SGPR s95. It has only 64 SGPRs to read Matrix A’s data.

Each group of s_load_dwordx8 instructions reads 64 data from 8x M and 8xK. The algorithm has 8x groups to read 64x different M.

AMD GCN architect does not support in-order return of s_load_dword. So there is no double buffer loading of Matrix A for this algorithm.

We postpone the performance analysis of limited SGPR number and unhiding latency by out of order SGPR return.

2.5 Double Buffer Prefetch of Matrix B

Each thread uses micro-tile size M=64, N=1. Each thread needs 8x VGPRs to load 1x N’s 8xK data. The algorithm uses global_load_dwordx4 to have best cache line hit. The next memory read instruction reads next 4 DWORDs of the same cache line.

global_load_dwordx4 v[68:71], v[2:3], s[20:21]

s_add_u32 s20, s20, 16

s_addc_u32 s21, s21, 0

global_load_dwordx4 v[72:75], v[2:3], s[20:21]

s_add_u32 s20, s20, 16

s_addc_u32 s21, s21, 0

Double buffer has better latency hiding. It needs 16x VGPRs to support this feature.

2.6 VGPRs Allocations

Every thread needs V[2:3] for Matrix B’s per thread offset.

Double Buffer Loading of Matrix B needs 16x VGPRs.

64x M needs 64x VGPRs.

In addition to hipThreadIdx_x , totally 16 + 64 + 2 + 1 = 83 VGPRs.

83 VGPRs means 3 waves per SIMD or 3 workgroups per CU. It is good to have good performance.

2.7 NO LDS Operation At All

2.8 No Barrier At All

2.9 FMA with SGPR source and Data Forwarding to Saving SGPR

Modern GPU usually has one constant loading cache which is independent from Texture/Buffer L1 Cache. SIMD FMA instructions allows to have one operand from Constant data. AMD GCN architecture even promotes the constants into Scalar GPRs. The constant Cache data can be stored into Scalar SGPRs. The FMA instruction of GCN has following syntax to support SGPR:

v_fma_f32 v4, v68, s32, v4

v_fma_f32 v4, v69, s33, v4

v_fma_f32 v4, v70, s34, v4

v_fma_f32 v4, v71, s35, v4

v_fma_f32 v4, v72, s36, v4

v_fma_f32 v4, v73, s37, v4

v_fma_f32 v4, v74, s38, v4

v_fma_f32 v4, v75, s39, v4

v_fma_f64 with SGPRs means 25% less GPR read/write access. In other words, it is possible to save 25% dynamic power of VGPR access.

2.10 Matrix C Address

Matrix C address is very similar to Matrix B since every thread has different N value.

2.11 Theoretical Comparison of VGPR/L1 Cache/LDS Access

Following table give the example of Macro Tile Size M=64, N =256. It is very clear that this new SGEMM algorithm reduces 70% VGPR reading by SQC constant Loading and Data Forwarding of Accumulator.

Costs for Matrix Multiply 64x1x256	Legeacy	SQC
Unit in FP64	LDS	Non-LDS
Matrix A L2-L1	64	64
Matrix A VGPR Write	576	64
Matrix A VGPR Read	16384	64
Matrix A LDS Write	64	0
Matrix A LDS Read	512	0
Matrix B L2-L1	256	256
Matrix B L1 Read	256	256
matrix B VGPR write	2304	256
matrix B VGPR read	16384	16384
Matrix B LDS write	256	0
Matrix B LDS read	2304	0
Matrix C VGPR read/write+	32768	4096
SUM-L2-L1	320	320
SUM-L1-Read	320	320
VGPR Read/Write	68416	20864
LDS Read/Write	3136	0
Barrier	1	0

However, there are several performance limits to prevent this kernel to achieve more than 78% performance on AMD GCN architect.

AMD GCN supports only 96 SGPRs for program. This limitation prevents SGEMM kernel to do buffer loading.
AMD GCN returns constants out of order. The SGEMM kernel has to use “s_waitcnt lgkmcnt(0)” to avoid dirty data return . It makes the latency hiding very hard.

3 Benchmark

3.1 Performance Testing of SGEMM_64x256

The following result is measured on MI60 with different GPU engine frequencies with fixed memory frequency = 800mhz.

K=640	GFX1700Mhz	GFX1500Mhz	GFX1300Mhz	GFX1100Mhz
M=N=256	0.423	0.378	0.329	0.282
M=N=512	1.125	1.052	1.033	0.896
M=N=768	2.458	2.264	2.092	1.853
M=N=1024	4.368	3.903	3.622	3.331
M=N=1280	5.687	5.213	4.753	4.241
M=N=1536	7.058	6.435	5.739	4.995
M=N=1792	6.493	5.972	5.463	4.807
M=N=2048	8.13	7.448	6.797	6.047
M=N=2304	8.366	7.63	6.828	5.95
M=N=2560	8.561	7.856	7.11	6.226
M=N=2816	9.35	8.558	7.711	6.741
M=N=3072	9.825	8.918	8.048	7.071
M=N=3328	9.758	8.896	8.026	6.998
M=N=3584	9.66	8.875	7.966	6.968
M=N=3840	9.868	9.002	8.139	7.089
M=N=4096	9.954	9.145	8.226	7.185
M=N=4352	9.821	9.07	8.192	7.229
M=N=4608	9.8	9.074	8.203	7.245
M=N=4864	9.856	9.088	8.252	7.258
M=N=5120	9.781	9.088	8.228	7.281
M=N=5376	9.76	9.101	8.285	7.304
M=N=5632	9.8	9.122	8.285	7.346
M=N=5888	9.737	9.13	8.37	7.372
M=N=6144	9.678	9.092	8.302	7.347
M=N=6400	9.672	9.121	8.328	7.383
M=N=6656	9.674	9.173	8.343	7.414
M=N=6912	9.684	9.166	8.375	7.408
M=N=7168	9.638	9.18	8.359	7.413
M=N=7424	9.657	9.155	8.377	7.452
M=N=7680	9.655	9.16	8.4	7.444
M=N=7936	9.67	9.168	8.398	7.466
M=N=8192	9.61	9.133	8.414	7.42
M=N=8448	9.666	9.211	8.413	7.489
M=N=8704	9.662	9.236	8.417	7.465
M=N=8960	9.651	9.217	8.471	7.511
M=N=9216	9.608	9.199	8.459	7.477
M=N=9472	9.643	9.234	8.454	7.509
M=N=9728	9.689	9.227	8.449	7.527
M=N=9984	9.682	9.258	8.484	7.517
M=N=10240	9.605	9.258	8.453	7.498
M=N=10496	9.716	9.297	8.493	7.518
M=N=10752	9.664	9.299	8.523	7.539
M=N=11008	9.672	9.299	8.521	7.537
M=N=11264	9.62	9.253	8.517	7.527
M=N=11520	9.672	9.297	8.5	7.532
M=N=11776	9.652	9.275	8.497	7.548
M=N=12032	9.675	9.318	8.515	7.534
M=N=12288	9.634	9.277	8.493	7.521
M=N=12544	9.681	9.339	8.531	7.556
M=N=12800	9.675	9.326	8.524	7.553
M=N=13056	9.675	9.362	8.54	7.567
M=N=13312	9.666	9.344	8.57	7.581
M=N=13568	9.698	9.403	8.552	7.556
M=N=13824	9.714	9.392	8.565	7.581
M=N=14080	9.703	9.429	8.57	7.591
M=N=14336	9.604	9.353	8.559	7.58
M=N=14592	9.674	9.391	8.558	7.605
M=N=14848	9.657	9.312	8.545	7.587
M=N=15104	9.601	9.266	8.495	7.535
M=N=15360	9.61	9.322	8.499	7.516
M=N=15616	9.661	9.351	8.541	7.554
M=N=15872	9.663	9.363	8.562	7.591
M=N=16128	9.71	9.426	8.575	7.583
M=N=16384	9.532	9.228	8.508	7.532

3.2 Power Testing

Non-workload == 42 watts, GFX1700Mhz

Data Forwarding:
- M=N=4096, K=640, Max Power = 265 watts, with 9.5T
NO-Forwarding,
- M=N=4096, K=640, Max Power = 284 watts, with 9.18T

Non-workload == 36 watts, GFX1500Mhz

Data Forwarding:
- M=N=4096, K=640, Max Power = 223-watts, with 9.132T
NO-Forwarding,
- M=N=4096, K=640, Max Power = 240 watts, with 8.986T

4 Run the test

4.1 Run the test

Hardware: MI60/MI50

Software: ROCm

Command Line to Build the test:

hipcc sgemm_sqc_test.cpp -o sgemm_sqc_test.exe

Command Lien to run the test:

./ sgemm_sqc_test.exe <M> <N> <K> 64 256 <iterations=10> <verify=0>

For example:

./ sgemm_sqc_test.exe 16384 16384 640 64 256 10 0

4.2 Source Code

The GCN LLVM assembly is written in sgemm_64x256_sqc.cpp by inline assembly.

Compiling Command line of sgemm_64x256_sqc.cpp :

hipcc sgemm-64x256-sqc.cpp -o sgemm-64x256-sqc.out

Extract the kernel by following Command line which will generate sgemm-64x256-sqc.out-000-gfx906.isa:

extractkernel -i sgemm_64x256_sqc.out

Extract the correct kernel from sgemm-64x256-sqc.out-000-gfx906.isa and fill into sgemm_64x256_sqc.s.

Compile sgemm_64x256_sqc.s into LLVM code object :

/opt/rocm/hcc/bin/clang -x assembler -target amdgcn--amdhsa -mcpu=gfx906 -mno-code-object-v3 sgemm_64x256_sqc.s -o sgemm_sqc.co

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
sconv-32x32-splitk4-v2.cpp		sconv-32x32-splitk4-v2.cpp
sconv-32x32-splitk4-v2.s		sconv-32x32-splitk4-v2.s
sconv-test.cpp		sconv-test.cpp
sgemm-64x128-sqc-v1.cpp		sgemm-64x128-sqc-v1.cpp
sgemm-64x256-sqc-v1.cpp		sgemm-64x256-sqc-v1.cpp
sgemm-64x256-sqc-v2.cpp		sgemm-64x256-sqc-v2.cpp
sgemm-64x256-sqc-v3.cpp		sgemm-64x256-sqc-v3.cpp
sgemm-64x256-sqc-v4.cpp		sgemm-64x256-sqc-v4.cpp
sgemm-64x256-sqc-v5.cpp		sgemm-64x256-sqc-v5.cpp
sgemm_64x128_sqc.s		sgemm_64x128_sqc.s
sgemm_64x256_sqc.s		sgemm_64x256_sqc.s
sgemm_64x256_sqc_sum_no_reuse.s		sgemm_64x256_sqc_sum_no_reuse.s
sgemm_64x256_sqc_v3.s		sgemm_64x256_sqc_v3.s
sgemm_64x256_sqc_v4.s		sgemm_64x256_sqc_v4.s
sgemm_64x256_sqc_v5.s		sgemm_64x256_sqc_v5.s
sgemm_64x64_sqc.s		sgemm_64x64_sqc.s
sgemm_sqc.co		sgemm_sqc.co
sgemm_sqc_test.cpp		sgemm_sqc_test.cpp
sgemm_sqc_test_64x128.cpp		sgemm_sqc_test_64x128.cpp
sgemm_sqc_test_64x64.cpp		sgemm_sqc_test_64x64.cpp
sgemm_sqc_test_device1.cpp		sgemm_sqc_test_device1.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1 Legacy DGEMM implementation

2 Very Low Power SGEMM/DGEMM Algorithms for SGEMM

2.1 Macro Tile per Workgroup and Micro-Tile per Thread

2.2 Matrix A Base Offset Per Wave

2.3 Matrix B Base Offset per Wave

2.4 Matrix A’s Offset for Each K

2.5 Double Buffer Prefetch of Matrix B

2.6 VGPRs Allocations

2.7 NO LDS Operation At All

2.8 No Barrier At All

2.9 FMA with SGPR source and Data Forwarding to Saving SGPR

2.10 Matrix C Address

2.11 Theoretical Comparison of VGPR/L1 Cache/LDS Access

3 Benchmark

3.1 Performance Testing of SGEMM_64x256

3.2 Power Testing

4 Run the test

4.1 Run the test

4.2 Source Code

About

Releases

Packages

Languages

fsword73/SGEMM_on_VEGA

Folders and files

Latest commit

History

Repository files navigation

1 Legacy DGEMM implementation

2 Very Low Power SGEMM/DGEMM Algorithms for SGEMM

2.1 Macro Tile per Workgroup and Micro-Tile per Thread

2.2 Matrix A Base Offset Per Wave

2.3 Matrix B Base Offset per Wave

2.4 Matrix A’s Offset for Each K

2.5 Double Buffer Prefetch of Matrix B

2.6 VGPRs Allocations

2.7 NO LDS Operation At All

2.8 No Barrier At All

2.9 FMA with SGPR source and Data Forwarding to Saving SGPR

2.10 Matrix C Address

2.11 Theoretical Comparison of VGPR/L1 Cache/LDS Access

3 Benchmark

3.1 Performance Testing of SGEMM_64x256

3.2 Power Testing

4 Run the test

4.1 Run the test

4.2 Source Code

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages