-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metal support #127
base: master
Are you sure you want to change the base?
Metal support #127
Conversation
Seems to be only marginally faster compared to pure AMX
can't make it on M1 Max: c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main -framework Accelerate |
@DiegoGiovany |
This may or may not be helpful, but Warren Moore writes:
|
Hi. Firstly, thanks for this repo. This project is awesome! Forgive me if im incorrect in understanding the ramifications of this, but one thought after a brief look at this PR - it might make sense to decouple the command buffer commit / wait / read back cycle from each function call like in Is it feasible to rather, commit the first set of operations to a MTLBuffer as necessary, and then keep compute on the GPU, and encode all of the multiplies in a single command buffer, dont read back , and do single
At the very end of calculation? This would remove any CPU / GPU pipeline stalls, keep compute on the GPU, and also allow for some work to be done on the CPU while waiting for the GPU to complete. Forgive me if I dont get side effects of this proposed change (im not familiar enough with the internals of how Whisper works). Thank you! |
@vade For example, if I have the following operations: auto c = ggml_mul_mat(ctx, a, b)
auto e = ggml_mul_mat(ctx, c, d);
auto g = ggml_mul_mat(ctx, e, f);
// do something with "g"
... Ideally I would want this to be a single command buffer with 3 matrix multiplications that starts with The proposal in this PR is a very rough starting point and is for sure far from optimal. |
Thanks @ggerganov - and to be clear, I wasn't trying to point out any flaws, I'm aware this entire endeavor is a work in progress and theres a lot of moving pieces (and bravo on that!). I was hesitant to mention only because I'm not entirely familiar with the code base or whispers internals as implemented here. Does it make sense break down some changes that would benefit pipelining to GPU for all supported platforms? My suspicion is that anything metal benefits from would benefit CUDA, etc. If I may propose a few baby steps to break this potentially large change into manageable changes for all platforms and make integration easier?
Apologies, im not intending to step in and try to manage your project, just to start a conversation and make a set of actionable proposals that the community can rally around :) Thank you, and again, this project is really awesome. My assumptions for changes would
LMK - I'm happy to help, and potentially even sponsor some of this development. |
Hi, Just curious if this still on the roadmap and being actively worked on? Thanks for your hard work. |
ggerganov/llama.cpp#1642 |
Yes, it will come for sure |
It is already optimized for Apple silicon via ARM NEON, Accelerate framework and Core ML. I am using the medium.en model and it is super fast on my M1 Pro 16GB, it is absolutely amazing. Only the first run will be slow. Can Metal make it even faster? That would be unbelievable. |
This is quick and dirty implementation of GPU support for Apple hardware using Metal Performance Shaders. It demonstrates how part of the feed forward layer in the encoder can be offloaded to the GPU.
On my MacBook M1 Pro, I don't observe significant performance gain compared to the original implementation. Either I have a problem in my MPS integration, or simply the AMX coprocessor is doing a good enough job and adding Metal does not really help.
In any case, this PR can be a good starting point for anyone interested in adding GPU support to
ggml
. I think a similar approach can be taken for CUDA.For now, I don't plan to merge this into
master
unless the performance gets better.