Matmul kernels on x86 AVX512 Sapphire Rapids #12666

vmurali · 2023-03-16T23:22:12Z

vmurali
Mar 16, 2023

I performed an experiment on x86 AVX512-based standalone f32 GEMM kernels (https://github.com/vmurali/matmul) with A transposed and B, C, D kept as is. Each size is run 10 times and averaged. The kernels don't use a threadpool (yet), and instead launch new threads (default number of threads = total number of hardware threads = 176 in the core I ran). Here are the results.

The aligned cases run at around half the speed of the unaligned cases!! This is because of cache set conflicts - the L1$ has only 64 sets per way, each set containing 64 bytes (= 16 f32s). So a 512 (= 32x16) row size will exhaust half the sets in a way, so only two sets of the entire L1 cache will be used overall, till all the ways are exhausted (there are 12 ways, so 24 fetches from the transposed A matrix will start trashing the cache leading to abysmal performance since the transposed A's columns don't get stored in the cache while iterating over B's columns). 1024 and higher will create set conflicts with just one row and start trashing the 12 ways after just 12 rows.

This shows that there's a lot of performance left to be squeezed, using both data layout changes, and tightening codegen.

Matmul MxNxK	GFLOPS	Latency in micro secs
384 128 512	17.385418 GFLOPS	2895.049600 us
384 512 128	12.085292 GFLOPS	4164.702600 us
384 384 32	2.430755 GFLOPS	3882.408200 us
384 32 384	5.831643 GFLOPS	1618.271900 us
384 128 128	4.587917 GFLOPS	2742.619600 us
384 512 384	36.976851 GFLOPS	4083.499200 us
384 2 512	0.791025 GFLOPS	994.194100 us
2 2 2	0.000378 GFLOPS	42.335100 us
3 3 3	0.001508 GFLOPS	35.812700 us
1 1 1	0.000055 GFLOPS	36.677600 us
4 4 4	0.002352 GFLOPS	54.423300 us
5 5 5	0.006750 GFLOPS	37.036700 us
3 3 3	0.001515 GFLOPS	35.646200 us
8 8 8	0.025895 GFLOPS	39.544500 us
9 9 9	0.039837 GFLOPS	36.599100 us
7 7 7	0.018750 GFLOPS	36.586400 us
16 16 16	0.221751 GFLOPS	36.942300 us
17 17 17	0.092951 GFLOPS	105.711200 us
15 15 15	0.184827 GFLOPS	36.520700 us
32 32 32	0.767652 GFLOPS	85.372000 us
33 33 33	0.209790 GFLOPS	342.599100 us
31 31 31	0.703369 GFLOPS	84.709500 us
64 64 64	0.870972 GFLOPS	601.957100 us
65 65 65	0.544571 GFLOPS	1008.592700 us
63 63 63	0.670742 GFLOPS	745.582900 us
128 128 128	2.164383 GFLOPS	1937.874800 us
129 129 129	1.769458 GFLOPS	2426.379600 us
127 127 127	2.143385 GFLOPS	1911.353500 us
256 256 256	9.233670 GFLOPS	3633.921500 us
257 257 257	8.537844 GFLOPS	3976.318200 us
255 255 255	9.232948 GFLOPS	3591.783200 us
512 512 512	56.422483 GFLOPS	4757.597200 us
513 513 513	55.891061 GFLOPS	4831.029900 us
511 511 511	43.059457 GFLOPS	6197.608500 us
1024 1024 1024	317.333901 GFLOPS	6767.268100 us
1025 1025 1025	291.352830 GFLOPS	7392.346900 us
1023 1023 1023	265.758840 GFLOPS	8056.922100 us
2048 2048 2048	1081.124305 GFLOPS	15890.743200 us
2049 2049 2049	1165.178294 GFLOPS	14766.020700 us
2047 2047 2047	979.282165 GFLOPS	17517.642800 us
4096 4096 4096	521.757571 GFLOPS	263415.343300 us
4097 4097 4097	1096.204337 GFLOPS	125468.977600 us
4095 4095 4095	847.803790 GFLOPS	161993.037200 us
8192 8192 8192	231.393267 GFLOPS	4751700.916200 us
8193 8193 8193	482.674736 GFLOPS	2278789.871200 us
8191 8191 8191	453.117277 GFLOPS	2425661.187300 us
16384 16384 16384	314.786240 GFLOPS	27943066.335100 us
16385 16385 16385	360.317365 GFLOPS	24416540.375800 us
16383 16383 16383	346.563852 GFLOPS	25376224.919100 us

(Posting from #11821 (comment) to not hijack the other discussion, though the performance difference w.r.t. IREE's current codegen stack must be addressed)

vmurali · 2023-03-16T23:22:51Z

vmurali
Mar 16, 2023
Author

@MaheshRavishankar @hanhanW @dcaballe @bjacob @benvanik @mattwalsh @stellaraccident

11 replies

vmurali Mar 17, 2023
Author

The sapphire rapids that I use has this config:

muralivi@x86-dev:~/mm$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
CPU(s):                          176
Thread(s) per core:              2
Core(s) per socket:              44
Socket(s):                       2

by cores, what exactly do you mean?

vmurali Mar 17, 2023
Author

Size	4 threads	8 threads	16 threads	32 threads
384 128 512	563.746100 us	387.016600 us	638.449800 us	1137.398800 us
384 512 128	450.415500 us	373.767900 us	644.829300 us	1296.415300 us
384 384 32	182.398700 us	349.741100 us	701.111900 us	1283.754200 us
384 32 384	155.000600 us	257.506500 us	744.143100 us	1040.017900 us
384 128 128	276.817000 us	236.898800 us	609.914600 us	1256.214200 us
384 512 384	1191.20280 us	775.944400 us	660.209300 us	1152.959800 us
384 2 512	155.289400 us	229.817400 us	458.684400 us	924.3909000 us

vmurali Mar 17, 2023
Author

It's definitely true that there's a sweet spot, and there's also thread creation and destruction that's being avoided (I am not using a threadpool yet). But these are all in the range of noise imo.

MaheshRavishankar Mar 17, 2023
Collaborator

Wait, Am I reading this right? The latency is (mostly) increasing with number of threads... That gets thread creation and destruction into the picture cause that's the overhead....

vmurali Mar 17, 2023
Author

Yes, exactly. I need to fix it with a threadpool (I haven't done so yet). IREE uses the runtime's threadpool already AFAIU

mattwalsh · 2023-03-19T01:53:09Z

mattwalsh
Mar 19, 2023

As a reality check, can we compare to eigen or MKL, etc? There are square matrix and single-threaded data points here

For instance, if I look at 10x10x10 I see 2 GFLOPS. Looks like we are 10x off of that. For 400x400x400 and larger (levels off) eigen is at ~17 GF, or 5x of the other table.

Also I would suggest we stick with flops vs. latency.

1 reply

vmurali Mar 20, 2023
Author

As a reality check, can we compare to eigen or MKL, etc? There are square matrix and single-threaded data points here

Sorry Matt, I haven't had a chance to look at this. I am also not convinced that the time measurements have the right units (nano seconds), because measuring using chrono::high_resolution_clock gives a different value compared to clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...). It looks like clock_gettime is simply giving the number of cycles rather than the wallclock time, but I cannot be sure. The only accurate way of measuring these things is to use the same function to output time in the different versions, and see the relative performance between them.

For instance, if I look at 10x10x10 I see 2 GFLOPS. Looks like we are 10x off of that. For 400x400x400 and larger (levels off) eigen is at ~17 GF, or 5x of the other table.

Also I would suggest we stick with flops vs. latency.

It again goes back to my previous point - I am not sure what the units are, so getting either FLOPS or latency is not trivial. But the relative latency/performance is always accurate.

vmurali · 2023-03-20T09:09:46Z

vmurali
Mar 20, 2023
Author

I did a lot of experiments using my matmul repo in https://github.com/vmurali/matmul for f32 matmuls. I implemented 4 algorithms for matmul

"Basic" output 16x16 tiled matmul - it has a 3 instruction inner loop that performs a multiplication of a scalar from A, a 16 element vector from B and an addition to a 16 element vector from C. The A matrix is transposed (i.e. column major), B and C are as is (i.e. row major) and there are no alignment requirements. The 3 instructions are i) broadcast an element of a 16-element vector of A to 16-elements temp vector; ii) VFMADD temp with B-vector, C-vector, iii) increment the index to broadcast from (index ranges from 0 to 15). Another thing to note in the basic version is that on the first access of every 16-element columns of A, I store it into a different region contiguously w.r.t. neighboring columns so that subsequent access of 16-element columns are all contiguous.
"Optimized" output 16x16 tiled matmul - it is the same as "Basic" has a 1 instruction inner loop. The instruction loads a scalar of A-matrix from memory and FMADDs with B-vector, C-vector, all in one instruction. The index of which scalar element of A to pick up is given by a constant by unrolling the loop over 16 elements of A. (A matrix is transposed, B and C are not, and no alignment requirements). This still preserves the storing of 16-element columns contiguously.
"DataTile only" output 16x16 tiled matmul - it assumes A and B matrices are appropriately data-tiled so that every access of A and B are contiguous. C matrix is not data-tiled, so it is in row-major form. This requires M, N and K sizes to be all aligned to 16 f32 elements. (The unaligned numbers were converted to aligned-to-16 numbers in the following table).
"Pack+DataTile" output 16x16 tiled matmul - it explicitly packs the A and B matrices inside the kernel and then calls the "DataTile only" algorithm. Input matrices A and B are in row-major form and output matrix C is also in row-major form (C is not data-tiled). No alignment requirements.
"Iree": This runs IREE without data-tiling. I measure IREE's dispatch latency for the matrix multiply by measure that time https://github.com/openxla/iree/blob/main/runtime/src/iree/vm/native_module.c#L324 takes to execute. That shim method seems to start right before filling work queue and end right after the worker threads finish the work, so it is the right place to measure the dispatch's time without including the VM's execution time.

The measurements of time were done using clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...). These are supposed to provide nanoseconds, but I am not sure if that is accurate. Nevertheless, it gives the relative performance of various algorithms and IREE.

Threading support:
I have two versions of threading - a threadpool based approach (so that there's no thread-creation overhead to starting a job) and one which explicitly creates new threads. I didn't find a big difference in performance between the two. And, the biggest shocker was that single-threaded implementations were almost always faster than any other number of threads (2/4/6/8/16/32/176). In the couple of cases where they were slower, they were almost as fast as the winning number of threads. I didn't post those results, but the experiments can be reproduced by just supplying the number of threads as an argument to the program.

Unaligned vs aligned:
I tested NxNxN, where N is a power of two, as well as (N-1)x(N-1)x(N-1) and (N+1)x(N+1)x(N+1). Of course, for the "DataTile only" case, (N-1) becomes N, and N+1 becomes N+16.

Observations:
a) The "Basic" version is always faster than IREE except for 32x32x32 and 64x64x64. By 256x256x256, it becomes 2x faster and keeps widening the gap.
b) The "Optimized" version is better than the "Basic" version by a non-trivial amount once the matrix sizes become 4096x4096x4096. The "Optimized" version is also always faster than IREE except for 32x32x32 and 64x64x64.
c) The "DataTile only" version is better than "Optimized" version for matrices from 128x128x128.
d) The "Pack + DataTile" version is better than "Optimized" version for matrices from 512x512x512.

Except for small matrices, "Packing + DataTile" seems to dominate "Optimized", "Basic" and "IREE" versions.

This table is a representation of the data from http://sheets/1GCowDsxuwP0EyKw6AbEuswI5nhiEM1iEoPVGhQERw8c#gid=1788867146

Reproduction:
The github repo https://github.com/vmurali/matmul contains four branches:

main - "Basic" version
single - "Optimized" version
datatile-nopack - "DataTile only" version
datatile - "Pack + DataTile" version

Just running perf.sh on each of the branches will give the values for filling the table below on the machine it was run.

M	N	K	Icelake Iree	Icelake Basic	Icelake Optimized	Icelake DataTile only	Icelake Pack+DataTile	Sapphire Rapids Iree	Sapphire Rapids Basic	Sapphire Rapids Optimized	Sapphire Rapids DataTile only	Sapphire Rapids Pack+DataTile
32	32	32	78014	175708	157262	176882	172887	114667	149421	157272	184042	201347
33	33	33	210227	146948	149275	174145	122604	276069	123468	155381	133495	177948
31	31	31	115176	113403	109078	128934	160101	157395	135247	141012	163389	156016
64	64	64	151365	179626	152302	121968	200865	183140	153471	128619	164051	154777
65	65	65	331090	213590	155973	126590	150440	344461	134331	133021	176933	157546
63	63	63	299227	153064	138414	152990	170543	337243	168049	145810	132233	175624
128	128	128	374945	301707	260845	245879	382159	367808	234417	207873	258404	315563
129	129	129	706986	340870	270505	246857	387622	708409	262445	246543	265851	307252
127	127	127	555477	358899	230067	202059	302159	540550	221068	201303	192381	260853
256	256	256	1971712	954633	1017836	835862	1330663	1674585	907718	710695	642990	1285945
257	257	257	2313709	1062413	1110741	`1061155`	1278031	2159168	1007932	798005	738603	1064241
255	255	255	3628006	948693	1022400	868900	1112878	3052916	930629	705556	725916	1275861
512	512	512	15336075	6379345	7766843	6376339	7360966	12919238	6253234	5264671	4290513	5717798
513	513	513	18238052	6429554	8147037	6869742	7958463	15478722	6540167	5181448	4797785	6114344
511	511	511	24084910	6156311	7659391	6290918	6661747	20080235	6004725	4882819	4272021	5495369
1024	1024	1024	153741449	54031314	64397284	51643163	55633302	160533761	55305922	43728540	33494188	38564009
1025	1025	1025	119225813	50242855	62426825	51460823	55556707	107558236	52423951	41894424	34826929	39676655
1023	1023	1023	150783079	53124260	60435865	51621676	50195203	125372724	52394373	41766767	33333588	35232500
2048	2048	2048	1136399884	472017478	506360510	410595759	433702536	1187250296	480808463	372972397	264965742	285361149
2049	2049	2049	963498124	451484945	490227768	402337492	429142437	746613407	404576621	329438984	269926438	287039995
2047	2047	2047	1034284588	542420665	483097174	410356967	406210165	929996398	418821908	326358834	264758168	277479610
4096	4096	4096	17856089233	9080863050	5982470867	3399127233	3464547590	15792142331	7583606995	5155754139	2323946309	2322751904
4097	4097	4097	9615094047	8065107021	5078935258	3284464955	3469842767	8066264256	5751030403	4078089705	2327857406	2328780980
4095	4095	4095	12549999365	8135615292	5023530149	3393111747	3270133859	11405860052	6111349323	4027956583	2239236403	2283203549
8192	8192	8192	213966065165	82314210718	60743289178	27676591940	27989109799	211543052562	87705122684	75867457764	19415391871	19842819208
8193	8193	8193	112889954371	69438094797	50324241355	26614526187	27974254771	104356776466	76354353771	55125010990	19315832672	19844693822
8191	8191	8191	124251516350	70008489717	49510699597	27704889776	26591949022	88595976524	71973753117	52997616965	19359373149	19447053660
16384	16384	16384	1796354298970	690909831052	483257033624	225823023004	227061692019	1862693489393	1011543675080	922929599032	157394223308	159957040774
16385	16385	16385	1666896866801	577418683627	365535395303	217578725171	227122314975	1449710927469	880481427423	583263471505	157690243622	160110692621
16383	16383	16383	1694985063164	600881234584	418620520350	225743550286	217301187997	1440142045317	910620383487	582852353951	157926543972	158168881507

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matmul kernels on x86 AVX512 Sapphire Rapids #12666

{{title}}

Replies: 3 comments 12 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Matmul kernels on x86 AVX512 Sapphire Rapids #12666

vmurali Mar 16, 2023

Replies: 3 comments · 12 replies

vmurali Mar 16, 2023 Author

vmurali Mar 17, 2023 Author

vmurali Mar 17, 2023 Author

vmurali Mar 17, 2023 Author

MaheshRavishankar Mar 17, 2023 Collaborator

vmurali Mar 17, 2023 Author

mattwalsh Mar 19, 2023

vmurali Mar 20, 2023 Author

vmurali Mar 20, 2023 Author

vmurali
Mar 16, 2023

Replies: 3 comments 12 replies

vmurali
Mar 16, 2023
Author

vmurali Mar 17, 2023
Author

vmurali Mar 17, 2023
Author

vmurali Mar 17, 2023
Author

MaheshRavishankar Mar 17, 2023
Collaborator

vmurali Mar 17, 2023
Author

mattwalsh
Mar 19, 2023

vmurali Mar 20, 2023
Author

vmurali
Mar 20, 2023
Author