-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea for matmul tiling #67
base: master
Are you sure you want to change the base?
Conversation
I'm still a mojo noob. Can you emit llvm IR as a target? I have an idea for fuzzing this to find what is actually the performance bottleneck. |
I'm not sure, maybe the debug-level option can do it, but probably not. Here is the link to the compilation options: cli-options. I didn't find anything in the docs for an option to emit the llvm IR or MLIR, so I'm not sure. But you could ask in the discord, maybe some knows. |
Performance Comparison:For this version of the matmul tiling with batching, I have seen in isolated benchmarks that it can be up to 1.6 times faster than the present version of batch matmul, but only when we get to bigger sizes of rows times cols, in smaller sizes it can be up to 10% slower sometimes, the other problem is that when we increase the batch size it get slower and it can get slower than the present version. Uncertain Performance Factors:Right now I don't know if my version is sometimes much faster thanks to reducing cache misses or if it's the unroll and vectorize unrolls that are helping. Proposed Solution for Batch Size Impact:For the batch problem where increasing the batch size makes the batch_matmul_tiling version slower i don't know if a good idea could be to divide the work between cores (so instead of one core working in three batches instead have one core work in only one batch). Request for Input:But if somebody has an idea of how to make this better, explain what is going wrong or want's to say that this idea doesn't help to make the matmul operation faster, i would appreciate the help. Isolated benchmarksIn the benchmark results, matmul_no_tiling is the matmul version that is implemented right now in llama2.mojo, v1 is an old version of my batch_matmul_tiling and v2 is the version that is implemented in the most recent commit of this branch. And size refers to the batch size. (Tests were done on a ryzen 3600x).
And here is information of what were the best values for tiles_j and tiles_i for rows and cols size and for each batch sizeBatch size 1Batch size 2Batch size 3 |
The speed up for 30000 by 288 size could be really significant. That matmul to produce the logits is 15% of the total time usually. So a separate matmul function for the logits could speed up the whole network 5%. The way you have implemented the On a bit of a tangent, from looking at the recent improvement in examples/matmul.mojo the optimal performance for some of the tiling strategies comes with a huge number of workers and very small SIMD widths. The standard we have found in this repo of SIMD width of 16 and 6 workers is almost certainly a local optimum that is not likely to work well for other tiling strategies. There are a a lot of magic numbers to fiddle with but it is probably worth checking a SIMD width of 4 and workers equal to Thanks for continuing to pursue this style of improvements in matmul and for sharing your benchmarking results. |
And also, testing for sizes that aren't powers of 2 is probably not useful. Neural networks basically stick to powers of 2 so it is not a very common case. By testing sizes that aren't powers of 2 you end up with tails in the vectorize calls that use SIMD width of 1. SIMD parallelism is really fast so those tail loops will dominate the time. One way to measure the impact of these tail loops if you are curious is to replace vectorize which only takes an integer We probably should change to |
Lastly I will also attach the file that I am using for testing the speeds of the matmul functions, and also thank you for your comments @mikowals . |
I implemented an idea for matmul tiling, but i didn't see any performance improvement i get the same results as the original matmul_parallel implementation, so maybe it seems i don't understand the concept of cache misses and tiling completely.
I did this implementation, because maybe it could use autotune depending on the hardware for the tiling sizes, but if you feel that this implementation doesn't work or isn't necessary you can close this pull request.