Replies: 3 comments 12 replies
-
@MaheshRavishankar @hanhanW @dcaballe @bjacob @benvanik @mattwalsh @stellaraccident |
Beta Was this translation helpful? Give feedback.
-
As a reality check, can we compare to eigen or MKL, etc? There are square matrix and single-threaded data points here For instance, if I look at 10x10x10 I see 2 GFLOPS. Looks like we are 10x off of that. For 400x400x400 and larger (levels off) eigen is at ~17 GF, or 5x of the other table. Also I would suggest we stick with flops vs. latency. |
Beta Was this translation helpful? Give feedback.
-
I did a lot of experiments using my matmul repo in https://github.com/vmurali/matmul for f32 matmuls. I implemented 4 algorithms for matmul
The measurements of time were done using Threading support: Unaligned vs aligned: Observations: Except for small matrices, "Packing + DataTile" seems to dominate "Optimized", "Basic" and "IREE" versions. This table is a representation of the data from http://sheets/1GCowDsxuwP0EyKw6AbEuswI5nhiEM1iEoPVGhQERw8c#gid=1788867146 Reproduction:
Just running perf.sh on each of the branches will give the values for filling the table below on the machine it was run.
|
Beta Was this translation helpful? Give feedback.
-
I performed an experiment on x86 AVX512-based standalone f32 GEMM kernels (https://github.com/vmurali/matmul) with A transposed and B, C, D kept as is. Each size is run 10 times and averaged. The kernels don't use a threadpool (yet), and instead launch new threads (default number of threads = total number of hardware threads = 176 in the core I ran). Here are the results.
The aligned cases run at around half the speed of the unaligned cases!! This is because of cache set conflicts - the L1$ has only 64 sets per way, each set containing 64 bytes (= 16 f32s). So a 512 (= 32x16) row size will exhaust half the sets in a way, so only two sets of the entire L1 cache will be used overall, till all the ways are exhausted (there are 12 ways, so 24 fetches from the transposed A matrix will start trashing the cache leading to abysmal performance since the transposed A's columns don't get stored in the cache while iterating over B's columns). 1024 and higher will create set conflicts with just one row and start trashing the 12 ways after just 12 rows.
This shows that there's a lot of performance left to be squeezed, using both data layout changes, and tightening codegen.
(Posting from #11821 (comment) to not hijack the other discussion, though the performance difference w.r.t. IREE's current codegen stack must be addressed)
Beta Was this translation helpful? Give feedback.
All reactions