Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea for matmul tiling #67

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

andresnowak
Copy link

I implemented an idea for matmul tiling, but i didn't see any performance improvement i get the same results as the original matmul_parallel implementation, so maybe it seems i don't understand the concept of cache misses and tiling completely.
I did this implementation, because maybe it could use autotune depending on the hardware for the tiling sizes, but if you feel that this implementation doesn't work or isn't necessary you can close this pull request.

@chadbrewbaker
Copy link

I'm still a mojo noob. Can you emit llvm IR as a target? I have an idea for fuzzing this to find what is actually the performance bottleneck.

@andresnowak
Copy link
Author

andresnowak commented Oct 29, 2023

I'm not sure, maybe the debug-level option can do it, but probably not. Here is the link to the compilation options: cli-options. I didn't find anything in the docs for an option to emit the llvm IR or MLIR, so I'm not sure. But you could ask in the discord, maybe some knows.

@andresnowak
Copy link
Author

andresnowak commented Nov 17, 2023

Performance Comparison:

For this version of the matmul tiling with batching, I have seen in isolated benchmarks that it can be up to 1.6 times faster than the present version of batch matmul, but only when we get to bigger sizes of rows times cols, in smaller sizes it can be up to 10% slower sometimes, the other problem is that when we increase the batch size it get slower and it can get slower than the present version.

Uncertain Performance Factors:

Right now I don't know if my version is sometimes much faster thanks to reducing cache misses or if it's the unroll and vectorize unrolls that are helping.

Proposed Solution for Batch Size Impact:

For the batch problem where increasing the batch size makes the batch_matmul_tiling version slower i don't know if a good idea could be to divide the work between cores (so instead of one core working in three batches instead have one core work in only one batch).

Request for Input:

But if somebody has an idea of how to make this better, explain what is going wrong or want's to say that this idea doesn't help to make the matmul operation faster, i would appreciate the help.

Isolated benchmarks

In the benchmark results, matmul_no_tiling is the matmul version that is implemented right now in llama2.mojo, v1 is an old version of my batch_matmul_tiling and v2 is the version that is implemented in the most recent commit of this branch. And size refers to the batch size. (Tests were done on a ryzen 3600x).

Benchmarking Batch Matmul size rows: 256 cols: 257

---Benchmarking Batch Matmul size 1 

matmul_no_tiling:  0.0094472170020269263
matmul_tiling_v1:  0.0096521414371533564
matmul_tiling_v2:  0.0097045264451799864
Speedup v1:  0.97876901862030041
Speedup v2:  0.97348562605227795

---Benchmarking Batch Matmul size 2 

matmul_no_tiling:  0.010007997733096786
matmul_tiling_v1:  0.0099102151930539652
matmul_tiling_v2:  0.0098821378719491804
Speedup v1:  1.0098668432660631
Speedup v2:  1.0127360964579197

---Benchmarking Batch Matmul size 3 

matmul_no_tiling:  0.0093809422369242629
matmul_tiling_v1:  0.0093954622202673706
matmul_tiling_v2:  0.0094042028123165599
Speedup v1:  0.99845457487852107
Speedup v2:  0.99752657658958255

Benchmarking Batch Matmul size rows: 288 cols: 288

---Benchmarking Batch Matmul size 1 

matmul_no_tiling:  0.0096099319234752272
matmul_tiling_v1:  0.009766561562512913
matmul_tiling_v2:  0.0097548760312336173
Speedup v1:  0.98396266300732915
Speedup v2:  0.98514136855309065

---Benchmarking Batch Matmul size 2 

matmul_no_tiling:  0.009754444146042502
matmul_tiling_v1:  0.0099567952152303935
matmul_tiling_v2:  0.009847029910130492
Speedup v1:  0.97967708837896295
Speedup v2:  0.99059759491613419

---Benchmarking Batch Matmul size 3 

matmul_no_tiling:  0.0095397642057742215
matmul_tiling_v1:  0.0097165812167425529
matmul_tiling_v2:  0.0098387078752803694
Speedup v1:  0.98180254895995112
Speedup v2:  0.96961555589456616

Benchmarking Batch Matmul size rows: 288 cols: 526

---Benchmarking Batch Matmul size 1 

matmul_no_tiling:  0.0090944199434028102
matmul_tiling_v1:  0.0096735606589226832
matmul_tiling_v2:  0.0096711324584785735
Speedup v1:  0.94013158794991525
Speedup v2:  0.94036763351636588

---Benchmarking Batch Matmul size 2 

matmul_no_tiling:  0.01085652218522592
matmul_tiling_v1:  0.010175540370016948
matmul_tiling_v2:  0.010179389057707397
Speedup v1:  1.0669234055830135
Speedup v2:  1.0665200164449777

---Benchmarking Batch Matmul size 3 

matmul_no_tiling:  0.012223272864538394
matmul_tiling_v1:  0.01202974127535472
matmul_tiling_v2:  0.012091105477308293
Speedup v1:  1.0160877598905773
Speedup v2:  1.0109309597437672

Benchmarking Batch Matmul size rows: 768 cols: 288

---Benchmarking Batch Matmul size 1 

matmul_no_tiling:  0.0090251176036460629
matmul_tiling_v1:  0.0093134737965503995
matmul_tiling_v2:  0.0092777183719242699
Speedup v1:  0.9690388141735965
Speedup v2:  0.97277339555352171

---Benchmarking Batch Matmul size 2 

matmul_no_tiling:  0.01100808389583873
matmul_tiling_v1:  0.011133949814441244
matmul_tiling_v2:  0.011094472951398482
Speedup v1:  0.98869530393973393
Speedup v2:  0.99221332496476433

---Benchmarking Batch Matmul size 3 

matmul_no_tiling:  0.013831300005510755
matmul_tiling_v1:  0.014546778026006759
matmul_tiling_v2:  0.014764149776927981
Speedup v1:  0.95081536136614775
Speedup v2:  0.93681656001112934

Benchmarking Batch Matmul size rows: 512 cols: 512

---Benchmarking Batch Matmul size 1 

matmul_no_tiling:  0.0099327076303092846
matmul_tiling_v1:  0.0094948465312779916
matmul_tiling_v2:  0.0094461088817192826
Speedup v1:  1.0461156583825644
Speedup v2:  1.0515131420443078

---Benchmarking Batch Matmul size 2 

matmul_no_tiling:  0.011745999389175548
matmul_tiling_v1:  0.011952078164257514
matmul_tiling_v2:  0.012129423746449097
Speedup v1:  0.98275791270356305
Speedup v2:  0.96838890574782688

---Benchmarking Batch Matmul size 3 

matmul_no_tiling:  0.015740909763133304
matmul_tiling_v1:  0.015292832909531566
matmul_tiling_v2:  0.015172045509576263
Speedup v1:  1.0292997939788164
Speedup v2:  1.0374942359089279

Benchmarking Batch Matmul size rows: 517 cols: 517

---Benchmarking Batch Matmul size 1 

matmul_no_tiling:  0.0094876283399390644
matmul_tiling_v1:  0.0096819575516779999
matmul_tiling_v2:  0.0095680270574208057
Speedup v1:  0.97992872715030077
Speedup v2:  0.99159714777150576

---Benchmarking Batch Matmul size 2 

matmul_no_tiling:  0.011453700257744186
matmul_tiling_v1:  0.012526445297113753
matmul_tiling_v2:  0.012414011340793353
Speedup v1:  0.91436157553677733
Speedup v2:  0.92264296715329119

---Benchmarking Batch Matmul size 3 

matmul_no_tiling:  0.015405388329939713
matmul_tiling_v1:  0.016434303311453336
matmul_tiling_v2:  0.016403064678389537
Speedup v1:  0.93739223610431022
Speedup v2:  0.93917744226393085

Benchmarking Batch Matmul size rows: 32000 cols: 288

---Benchmarking Batch Matmul size 1 

matmul_no_tiling:  0.10100235806277473
matmul_tiling_v1:  0.069087426123155105
matmul_tiling_v2:  0.070266022223472488
Speedup v1:  1.4619499340260287
Speedup v2:  1.4374281461607303

---Benchmarking Batch Matmul size 2 

matmul_no_tiling:  0.16044938790990781
matmul_tiling_v1:  0.15474855639607091
matmul_tiling_v2:  0.15481869386606639
Speedup v1:  1.0368393195167904
Speedup v2:  1.0363696004870868

---Benchmarking Batch Matmul size 3 

matmul_no_tiling:  0.24910548707551952
matmul_tiling_v1:  0.26800968083961246
matmul_tiling_v2:  0.2748345656232839
Speedup v1:  0.92946451148753106
Speedup v2:  0.90638339653742372

Benchmarking Batch Matmul size rows: 288 cols: 32000

---Benchmarking Batch Matmul size 1 

matmul_no_tiling:  0.10660833646145133
matmul_tiling_v1:  0.077345285095372765
matmul_tiling_v2:  0.077644624796772435
Speedup v1:  1.3783430538783967
Speedup v2:  1.3730291921751019

---Benchmarking Batch Matmul size 2 

matmul_no_tiling:  0.19172089169381107
matmul_tiling_v1:  0.14988448039538715
matmul_tiling_v2:  0.14738350027047495
Speedup v1:  1.2791243709025892
Speedup v2:  1.3008300884560966

---Benchmarking Batch Matmul size 3 

matmul_no_tiling:  0.30375463576945927
matmul_tiling_v1:  0.27437178288633463
matmul_tiling_v2:  0.26725780300590446
Speedup v1:  1.1070913800756881
Speedup v2:  1.1365604010549637

Benchmarking Batch Matmul size rows: 2000 cols: 3210

---Benchmarking Batch Matmul size 1 

matmul_no_tiling:  0.074251905068803345
matmul_tiling_v1:  0.046662753243243239
matmul_tiling_v2:  0.046237162072072069
Speedup v1:  1.5912456918635638
Speedup v2:  1.6058923545753816

---Benchmarking Batch Matmul size 2 

matmul_no_tiling:  0.12969313362276894
matmul_tiling_v1:  0.10596534143788745
matmul_tiling_v2:  0.10431080262849096
Speedup v1:  1.2239203107630219
Speedup v2:  1.2433336754648379

---Benchmarking Batch Matmul size 3 

matmul_no_tiling:  0.20629136233352635
matmul_tiling_v1:  0.19977977412063852
matmul_tiling_v2:  0.19267467536548707
Speedup v1:  1.0325938310900069
Speedup v2:  1.0706719081903713

Benchmarking Batch Matmul size rows: 5300 cols: 2000

---Benchmarking Batch Matmul size 1 

matmul_no_tiling:  0.12560178377365561
matmul_tiling_v1:  0.074967934189781024
matmul_tiling_v2:  0.076834332744924974
Speedup v1:  1.6754067606517633
Speedup v2:  1.6347091109729432

---Benchmarking Batch Matmul size 2 

matmul_no_tiling:  0.21137467584052363
matmul_tiling_v1:  0.17257673335817189
matmul_tiling_v2:  0.17061621113973746
Speedup v1:  1.2248156036296569
Speedup v2:  1.2388897539601575

---Benchmarking Batch Matmul size 3 

matmul_no_tiling:  0.33228994695523018
matmul_tiling_v1:  0.31707031299897648
matmul_tiling_v2:  0.30688218883349949
Speedup v1:  1.0480008166400077
Speedup v2:  1.0827931989742023

And here is information of what were the best values for tiles_j and tiles_i for rows and cols size and for each batch size

Batch size 1

test_i_j_1_matmul_cache.txt

Batch size 2

test_i_j_2_matmul_cache.txt

Batch size 3

test_i_j_3_matmul_cache.txt

@mikowals
Copy link
Contributor

The speed up for 30000 by 288 size could be really significant. That matmul to produce the logits is 15% of the total time usually. So a separate matmul function for the logits could speed up the whole network 5%.

The way you have implemented the _batch loop defeats the purpose. The speed up from batching comes by reusing the A vector's values across 2 to 3 multiplications at a time. By reorganising the loop so that the A values can not be reused there is no potential speed up so the batching should just be removed for simplicity. From this and my understanding of batching it is odd to define the aliases linked to the n parameter. Using n in that way may be producing reasonable alias values but it is probably a coincidence and maybe it would be more understandable just to hardcode good alias values.

On a bit of a tangent, from looking at the recent improvement in examples/matmul.mojo the optimal performance for some of the tiling strategies comes with a huge number of workers and very small SIMD widths. The standard we have found in this repo of SIMD width of 16 and 6 workers is almost certainly a local optimum that is not likely to work well for other tiling strategies. There are a a lot of magic numbers to fiddle with but it is probably worth checking a SIMD width of 4 and workers equal to rows or rows * 2 in new tiling strategies.

Thanks for continuing to pursue this style of improvements in matmul and for sharing your benchmarking results.

@mikowals
Copy link
Contributor

And also, testing for sizes that aren't powers of 2 is probably not useful. Neural networks basically stick to powers of 2 so it is not a very common case. By testing sizes that aren't powers of 2 you end up with tails in the vectorize calls that use SIMD width of 1. SIMD parallelism is really fast so those tail loops will dominate the time.

One way to measure the impact of these tail loops if you are curious is to replace vectorize which only takes an integer simd_width parameter and replace it with tile that allows for a variadic list of descending SIMD widths that it will step through.

We probably should change to tile at some point just to avoid any future mistakes from setting nelts to 64 for stories15M.bin which has dim=288. That would lead to 4 loops of 64 values at a time and 32 loops of 1 value at a time.

@andresnowak
Copy link
Author

andresnowak commented Nov 18, 2023

  1. Answering your first comment, the reason why I moved batch outside of calc_col specifically was with the idea to optimize the prefetching of values of that specific row (for the B tensor specifically) because if we have the batch inside col we are always changing addresses for the b tensor, and also from what the testing I have done and from what I understand it shouldn't affect the performance moving the batch outside the calc_col loop, because the values of A are also being reused in this version, we are always using the same values of a for each batch until we finish all the batches, then we change to other columns of the A tensor. But also to clarify the v1 version has the batch inside the calc_col function and from the benchmarks it seems my idea of moving the batch outside didn't help, the results seem to be the same for both versions.

  2. For the the aliases of tiles_i and tiles_j using n, is just an idea because at least using that formula helps to reduce the degraded performance when having more batches, but yeah it would be better to have hardcoded values or to use autotune, because as you can see in the text files that I added where I show the speeds for each batch with different i and j values you can see it can change the correct values for i and j tiles when we increment the amount of rows and cols.

  3. And for the idea of tiles, I wanted to try it, is juts that if i use it, I can't use vectorize_unroll, so that's why I wasn't using it, but maybe I could try it, or maybe I could also try it to test it in the matmul version that is currently being used and see if it helps, and i will also try the idea of reducing the simd width.

Lastly I will also attach the file that I am using for testing the speeds of the matmul functions, and also thank you for your comments @mikowals .
test_matmul_cache.mojo.txt. Also this test were done on a AMD ryzen 3600x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants