An Alternative SGEMM(DGEMM) on VEGA (MI25, MI50, MI60)
to Verify Power by LDS, SGPR , and Data Forwarding
https://github.com/NervanaSystems/maxas/wiki/SGEMM has a detailed explain of SGEMM on Maxwell Architecture. Most SGEMM/DGEMM implementation on GPU are using similar algorithms. The top level idea of legacy SGEMM/DGEMM are implemented as Following:
- Using work group Size (64,1,1)
- Each work group computes the matrix C’s region from (m,n), to [m+64, n+63], we calls [64x64] macro-tile for workgroup. In this example, only 64x64 macro tile size is discussed as example.
- Each work group will load Matrix A, 64 * K , Matrix B K * 64 , do 64 * K * 64 times of FMA computing
- Matrix A and Matrix B will be loaded into LDS,
- Every Thread does Matrix computing matrix A= 8xK, Matrix B=Kx8 for SGEMMEvery thread computes matrix C’s 8x8 micro tile size.
- For Each workgroup: Matrix A will be read 8xK times from LDS
- For Each workgroup: matrix B will be read 8xK times from LDS
- For Each workgroup: Matrix A will be read 64xKx64 times from VGPR
- For Each workgroup: Matrix B will be read 64xKx64 times from VGPR
- For Each workgroup: Matrix C will be read and write 64xKx64 times from VGPR
Memory read/write occupies very high ratio of total power energy. SGEMM/DGEMM computing includes following memory accesses in modern GPUs:
- External Video memory Read from GDDR or HBM to L2 Cache
- From L2 Cache to L1 Cache
- From L1 Cache to LDS
- From LDS to VGPR
- FMA reads VGPRs only for matrix Sum
In general, LDS/VGPR occupies almost 50% total energy for SGEMM/DGEMM.
The VLP SGEMM uses work group size 64 for macro tile M=64, N=64.
The workgroup size of 128 uses macro tile size M=64, N = 128.
The workgroup size of 256 uses macro tile size M=64, N = 256.
The micro tile size for each thread is M=64 and N =1. Each thread computes Matrix A= 64xK, Matrix B = Kx64, result in Matrix-C 64 x1 .
For 64 threads, the Matrix-C’s address is continuous for each M.
In this paper, the algorithm is based on macro tile size M=64 and N=64 if there is no special notation .
To have best use of Matrix A for SQC constants,
- hipBlockIdx_x = N/64
- hipBlockIdx_y = M/64
Every block has one base address for its Matrix A.
matrix_A_base_offset = hipBlockIdx_y * 64 * lda;
Every block has one base address for its Matrix B.
matrix_B_base_offset = hipBlockIdx_x * 64 * ldb;
matrix_A_koffset = k * sizeof (float)
The algorithm reads Matrix A’s data by Assembly Instruction “s_load_dwordx8”
s_load_dwordx8 s[32:39], s[12:13], s18
AMD GCN architects has 96 available SGPRs . This algorithm uses SGPR s32 to SGPR s95. It has only 64 SGPRs to read Matrix A’s data.
Each group of s_load_dwordx8 instructions reads 64 data from 8x M and 8xK. The algorithm has 8x groups to read 64x different M.
AMD GCN architect does not support in-order return of s_load_dword. So there is no double buffer loading of Matrix A for this algorithm.
We postpone the performance analysis of limited SGPR number and unhiding latency by out of order SGPR return.
Each thread uses micro-tile size M=64, N=1. Each thread needs 8x VGPRs to load 1x N’s 8xK data. The algorithm uses global_load_dwordx4 to have best cache line hit. The next memory read instruction reads next 4 DWORDs of the same cache line.
global_load_dwordx4 v[68:71], v[2:3], s[20:21]
s_add_u32 s20, s20, 16
s_addc_u32 s21, s21, 0
global_load_dwordx4 v[72:75], v[2:3], s[20:21]
s_add_u32 s20, s20, 16
s_addc_u32 s21, s21, 0
Double buffer has better latency hiding. It needs 16x VGPRs to support this feature.
Every thread needs V[2:3] for Matrix B’s per thread offset.
Double Buffer Loading of Matrix B needs 16x VGPRs.
64x M needs 64x VGPRs.
In addition to hipThreadIdx_x , totally 16 + 64 + 2 + 1 = 83 VGPRs.
83 VGPRs means 3 waves per SIMD or 3 workgroups per CU. It is good to have good performance.
Modern GPU usually has one constant loading cache which is independent from Texture/Buffer L1 Cache. SIMD FMA instructions allows to have one operand from Constant data. AMD GCN architecture even promotes the constants into Scalar GPRs. The constant Cache data can be stored into Scalar SGPRs. The FMA instruction of GCN has following syntax to support SGPR:
v_fma_f32 v4, v68, s32, v4
v_fma_f32 v4, v69, s33, v4
v_fma_f32 v4, v70, s34, v4
v_fma_f32 v4, v71, s35, v4
v_fma_f32 v4, v72, s36, v4
v_fma_f32 v4, v73, s37, v4
v_fma_f32 v4, v74, s38, v4
v_fma_f32 v4, v75, s39, v4
v_fma_f64 with SGPRs means 25% less GPR read/write access. In other words, it is possible to save 25% dynamic power of VGPR access.
Matrix C address is very similar to Matrix B since every thread has different N value.
Following table give the example of Macro Tile Size M=64, N =256. It is very clear that this new SGEMM algorithm reduces 70% VGPR reading by SQC constant Loading and Data Forwarding of Accumulator.
Costs for Matrix Multiply 64x1x256 |
Legeacy |
SQC |
Unit in FP64 |
LDS |
Non-LDS |
Matrix A L2-L1 |
64 |
64 |
Matrix A VGPR Write |
576 |
64 |
Matrix A VGPR Read |
16384 |
64 |
Matrix A LDS Write |
64 |
0 |
Matrix A LDS Read |
512 |
0 |
Matrix B L2-L1 |
256 |
256 |
Matrix B L1 Read |
256 |
256 |
matrix B VGPR write |
2304 |
256 |
matrix B VGPR read |
16384 |
16384 |
Matrix B LDS write |
256 |
0 |
Matrix B LDS read |
2304 |
0 |
Matrix C VGPR read/write+ |
32768 |
4096 |
SUM-L2-L1 |
320 |
320 |
SUM-L1-Read |
320 |
320 |
VGPR Read/Write |
68416 |
20864 |
LDS Read/Write |
3136 |
0 |
Barrier |
1 |
0 |
However, there are several performance limits to prevent this kernel to achieve more than 78% performance on AMD GCN architect.
- AMD GCN supports only 96 SGPRs for program. This limitation prevents SGEMM kernel to do buffer loading.
- AMD GCN returns constants out of order. The SGEMM kernel has to use “s_waitcnt lgkmcnt(0)” to avoid dirty data return . It makes the latency hiding very hard.
The following result is measured on MI60 with different GPU engine frequencies with fixed memory frequency = 800mhz.
K=640 |
GFX1700Mhz |
GFX1500Mhz |
GFX1300Mhz |
GFX1100Mhz |
M=N=256 |
0.423 |
0.378 |
0.329 |
0.282 |
M=N=512 |
1.125 |
1.052 |
1.033 |
0.896 |
M=N=768 |
2.458 |
2.264 |
2.092 |
1.853 |
M=N=1024 |
4.368 |
3.903 |
3.622 |
3.331 |
M=N=1280 |
5.687 |
5.213 |
4.753 |
4.241 |
M=N=1536 |
7.058 |
6.435 |
5.739 |
4.995 |
M=N=1792 |
6.493 |
5.972 |
5.463 |
4.807 |
M=N=2048 |
8.13 |
7.448 |
6.797 |
6.047 |
M=N=2304 |
8.366 |
7.63 |
6.828 |
5.95 |
M=N=2560 |
8.561 |
7.856 |
7.11 |
6.226 |
M=N=2816 |
9.35 |
8.558 |
7.711 |
6.741 |
M=N=3072 |
9.825 |
8.918 |
8.048 |
7.071 |
M=N=3328 |
9.758 |
8.896 |
8.026 |
6.998 |
M=N=3584 |
9.66 |
8.875 |
7.966 |
6.968 |
M=N=3840 |
9.868 |
9.002 |
8.139 |
7.089 |
M=N=4096 |
9.954 |
9.145 |
8.226 |
7.185 |
M=N=4352 |
9.821 |
9.07 |
8.192 |
7.229 |
M=N=4608 |
9.8 |
9.074 |
8.203 |
7.245 |
M=N=4864 |
9.856 |
9.088 |
8.252 |
7.258 |
M=N=5120 |
9.781 |
9.088 |
8.228 |
7.281 |
M=N=5376 |
9.76 |
9.101 |
8.285 |
7.304 |
M=N=5632 |
9.8 |
9.122 |
8.285 |
7.346 |
M=N=5888 |
9.737 |
9.13 |
8.37 |
7.372 |
M=N=6144 |
9.678 |
9.092 |
8.302 |
7.347 |
M=N=6400 |
9.672 |
9.121 |
8.328 |
7.383 |
M=N=6656 |
9.674 |
9.173 |
8.343 |
7.414 |
M=N=6912 |
9.684 |
9.166 |
8.375 |
7.408 |
M=N=7168 |
9.638 |
9.18 |
8.359 |
7.413 |
M=N=7424 |
9.657 |
9.155 |
8.377 |
7.452 |
M=N=7680 |
9.655 |
9.16 |
8.4 |
7.444 |
M=N=7936 |
9.67 |
9.168 |
8.398 |
7.466 |
M=N=8192 |
9.61 |
9.133 |
8.414 |
7.42 |
M=N=8448 |
9.666 |
9.211 |
8.413 |
7.489 |
M=N=8704 |
9.662 |
9.236 |
8.417 |
7.465 |
M=N=8960 |
9.651 |
9.217 |
8.471 |
7.511 |
M=N=9216 |
9.608 |
9.199 |
8.459 |
7.477 |
M=N=9472 |
9.643 |
9.234 |
8.454 |
7.509 |
M=N=9728 |
9.689 |
9.227 |
8.449 |
7.527 |
M=N=9984 |
9.682 |
9.258 |
8.484 |
7.517 |
M=N=10240 |
9.605 |
9.258 |
8.453 |
7.498 |
M=N=10496 |
9.716 |
9.297 |
8.493 |
7.518 |
M=N=10752 |
9.664 |
9.299 |
8.523 |
7.539 |
M=N=11008 |
9.672 |
9.299 |
8.521 |
7.537 |
M=N=11264 |
9.62 |
9.253 |
8.517 |
7.527 |
M=N=11520 |
9.672 |
9.297 |
8.5 |
7.532 |
M=N=11776 |
9.652 |
9.275 |
8.497 |
7.548 |
M=N=12032 |
9.675 |
9.318 |
8.515 |
7.534 |
M=N=12288 |
9.634 |
9.277 |
8.493 |
7.521 |
M=N=12544 |
9.681 |
9.339 |
8.531 |
7.556 |
M=N=12800 |
9.675 |
9.326 |
8.524 |
7.553 |
M=N=13056 |
9.675 |
9.362 |
8.54 |
7.567 |
M=N=13312 |
9.666 |
9.344 |
8.57 |
7.581 |
M=N=13568 |
9.698 |
9.403 |
8.552 |
7.556 |
M=N=13824 |
9.714 |
9.392 |
8.565 |
7.581 |
M=N=14080 |
9.703 |
9.429 |
8.57 |
7.591 |
M=N=14336 |
9.604 |
9.353 |
8.559 |
7.58 |
M=N=14592 |
9.674 |
9.391 |
8.558 |
7.605 |
M=N=14848 |
9.657 |
9.312 |
8.545 |
7.587 |
M=N=15104 |
9.601 |
9.266 |
8.495 |
7.535 |
M=N=15360 |
9.61 |
9.322 |
8.499 |
7.516 |
M=N=15616 |
9.661 |
9.351 |
8.541 |
7.554 |
M=N=15872 |
9.663 |
9.363 |
8.562 |
7.591 |
M=N=16128 |
9.71 |
9.426 |
8.575 |
7.583 |
M=N=16384 |
9.532 |
9.228 |
8.508 |
7.532 |
Non-workload == 42 watts, GFX1700Mhz
-
Data Forwarding:
- M=N=4096, K=640, Max Power = 265 watts, with 9.5T
-
NO-Forwarding,
- M=N=4096, K=640, Max Power = 284 watts, with 9.18T
Non-workload == 36 watts, GFX1500Mhz
-
Data Forwarding:
- M=N=4096, K=640, Max Power = 223-watts, with 9.132T
-
NO-Forwarding,
- M=N=4096, K=640, Max Power = 240 watts, with 8.986T
Hardware: MI60/MI50
Software: ROCm
Command Line to Build the test:
hipcc sgemm_sqc_test.cpp -o sgemm_sqc_test.exe
Command Lien to run the test:
./ sgemm_sqc_test.exe <M> <N> <K> 64 256 <iterations=10> <verify=0>
For example:
./ sgemm_sqc_test.exe 16384 16384 640 64 256 10 0
The GCN LLVM assembly is written in sgemm_64x256_sqc.cpp by inline assembly.
Compiling Command line of sgemm_64x256_sqc.cpp :
hipcc sgemm-64x256-sqc.cpp -o sgemm-64x256-sqc.out
Extract the kernel by following Command line which will generate sgemm-64x256-sqc.out-000-gfx906.isa:
extractkernel -i sgemm_64x256_sqc.out
Extract the correct kernel from sgemm-64x256-sqc.out-000-gfx906.isa and fill into sgemm_64x256_sqc.s.
Compile sgemm_64x256_sqc.s into LLVM code object :
/opt/rocm/hcc/bin/clang -x assembler -target amdgcn--amdhsa -mcpu=gfx906 -mno-code-object-v3 sgemm_64x256_sqc.s -o sgemm_sqc.co