-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: use kompute matmul shaders on embedded GPUs #11525
Conversation
I've used variants of Also, if this PR makes sense and goes in, I think we should consider dropping the Kompute backend. |
My test machine is an Apple M3 Pro with 36 GB of RAM. Functionally speaking, this change allows me to use various models with the Vulkan backend both natively, on macOS, and in a libkrun/krunkit microVM (which uses virtio-gpu+venus) running Linux. The performance numbers look like these:
The huge hit in prompt processing is expected because the kompute shaders doesn't have proper mul_mm support, something I intend to fix soon. The ~7% drop in token generation between native Metal vs. native Vulkan is unexpected, due the similarities between the MSL and GLSL implementations of mat_mul for q4_k, and deserves further investigation. The ~17% drop in performance for token generation between native Metal vs. microVM Vulkan is also unexpected, and requires investigation also from the VMM (Virtual Machine Monitor) perspective, since Activity Monitor indicates that the GPU isn't fully saturated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool that you got it to work! I added a few comments, I think this can be simplified further. You're also still welcome to contact me on Discord (_occam
), there I could assist more directly.
ggml/src/ggml-kompute/kompute
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, yes. I'll drop it.
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
@@ -166,6 +166,7 @@ struct vk_device_struct { | |||
uint32_t subgroup_size; | |||
uint32_t shader_core_count; | |||
bool uma; | |||
bool embedded; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embedded
doesn't quite fit, I think. Maybe something like matmul_fallback
or simple_shaders
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think prefer simple_shaders
over matmul_fallback
, because we might need to go beyond matmut.
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
@@ -4358,6 +4388,167 @@ static void ggml_vk_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, c | |||
} | |||
} | |||
|
|||
static void ggml_vkemb_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, bool dryrun = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can probably be handled with the main ggml_vk_mul_mat_q_f16
function. This would also give you the rest of the quants, since the function can do dequant in a separate shader + matmul in float16, and then you can avoid the separate backend and supports_op function.
Maybe switching the pipeline in ggml_vk_guess_matmul_pipeline
and the parameter handling in ggml_vk_matmul
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we need to reject OPs for MUL_MAT with unsupported quants (the complex implementations doesn't work) and whole OPs like MUL_MAT_ID. Other OPs also have trouble with certain params. I think the requirement of having a separate supports_op
function is going to be very hard to avoid.
Maybe switching the pipeline in ggml_vk_guess_matmul_pipeline and the parameter handling in ggml_vk_matmul?
OK, let me see if I can make it fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if the matmul shaders are problematic, maybe the dequants also don't work, so some quants may just not work.
The problem is that we need to reject OPs for MUL_MAT with unsupported quants (the complex implementations doesn't work) and whole OPs like MUL_MAT_ID. Other OPs also have trouble with certain params. I think the requirement of having a separate supports_op function is going to be very hard to avoid.
I understand. But at least you won't need to duplicate the backend if you add your function in front of the existing one and just jump directly to it if the device doesn't need the simple shaders.
std::string in_path = join_paths(input_dir, in_fname); | ||
std::string in_path; | ||
if (is_embed) { | ||
in_path = join_paths(input_dir + "/../../ggml-kompute/kompute-shaders", in_fname); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to copy shaders that we want to use, to keep the backends separated. This also means that we can do some Vulkan-backend-specific changes or optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. That's what I did at first, but felt weird having duplicated files. But if you prefer that option, I'm fine with it.
What is the long-term plan for these shaders? If MoltenVK fixes spec constants (or we can otherwise workaround it, which IMO is doable by using reasonable defaults for the spec constants), do we still keep these shaders? |
It's not just MoltenVK, Qualcomm also suffers from shader compiler bugs and there's other devices (for example Raspberry Pi) that have basic Vulkan support, but can't handle the current shaders. I think it's a good idea to have fallbacks for these cases, maybe look into how far you can optimize for these tiny GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look reasonable to me so far
301332a
to
7855aeb
Compare
I think you could completely avoid It's not immediately necessary, though, we could also reduce redundancy later. |
This could replace the Qualcomm OpenCL backend as well if their compiler was able to handle these shaders. Does the Kompute backend currently work on phones, ARM boards, and so forth?
I was actually blocked by that when I tried to run the Kompute backend in the original PR (wow it's been a year already 😃). I only skimmed through the shaders but there are only a few places which actually do fp16 math and those can easily be switched to regular floats for compatibility. Then again those devices who can't run the full shaders are probably slow will need all the help they can get... |
Yeah, I would rather do that later, after things settle down a bit.
I don't have the hardware to test (only thing I have is powered by an old Adreno 618), but seems like from 6xx Gen3 onwards, Adreno GPUs should support |
Note the difference between But that is also not immediately necessary here, we can improve that later. But to allow testing on other GPUs, can you add support for an environment variable to force simple shaders? I think the easiest way to implement this would be (example code, not tested):
|
We should also add a ci build and test for simple shaders as well. Let me know if you need help with that. |
We're not testing other shader variants either (e.g. coopmat stuff), it's fine for now. Let's get it set up and figure out the code first. |
Turns out @0cc4m do you think we could merge this one in its current state and improve on it later? |
Yeah, sure, I'll take a look and make sure that it doesn't break anything else. Just FYI, there was a bug/lacking feature with MoltenVK that may have been the source of the shader trouble, a (hacky) fix was just proposed that fixes this: ollama/ollama#1016 (comment) |
Is soft_max only problematic because it also uses an array sized by a spec constant? There's a MoltenVK fix in progress for that, so a lot of this should become unnecessary soon. If you found a bug that test-backend-ops isn't catching, can you add a new case to test-backend-ops to cover it? |
Yup, I hope that MoltenVK will be able to eventually deal with the regular shaders, but that's going to take a while to make into people's computers. I also suspect some GPUs (namely, QCs) will still need the simpler shaders.
By the way it fails, I suspect there's something else. I'd like to find some time next week to give that MoltenVK fix a try to see if it fixes softmax too or not.
The problem is that I know it fails because the model derails, and I found out to be softmax by disabling/enabling OPs individually, but I don't know the parameters which made softmax fail. Is there a reasonable way to find it out? I've tried building with |
Maybe just print out the parameters for the softmax, usually there are only a handful of unique combinations in a model. |
Ack, I'll give it a try, thanks! |
Import simpler matmul shaders from the kompute backend and use them on GPUs know to not be able to use the regular ones. Signed-off-by: Sergio Lopez <slp@redhat.com>
Even though the regular softmax shaders successfully pass test-backend-ops with Apple GPUs, running long inference tests has shown the models end derailing with softmax OPs being the root cause. With this commit, we use simpler softmax shaders borrowed from the Kompute backend (which are basically reimplementations of the Metal shaders) on certain GPUs know to have problem with the regular ones. Signed-off-by: Sergio Lopez <slp@redhat.com>
It was found that softmax vulkan shaders may fail on some contexts when ne00 > 1024, so let's add a test for it. Signed-off-by: Sergio Lopez <slp@redhat.com>
I've confirmed the problem arises when ne00 > 1024, and added a test for that. I've also confirmed that KhronosGroup/MoltenVK#2441 fixes the problem with Apple GPUs, so unless this is needed for Qualcomm GPUs (I don't have a 8cx to test it), we can probably close this one without merging. |
No, it actually doesn't even support Intel GPUs. We have also tried on a Qualcomm Adreno X1-85 without success. Only NVIDIA and AMD GPUs are well-supported by these shaders in our experience. Not surprising that they are compatible with MoltenVK since they were based on the Metal shaders. |
That's up to you. There could at least be a use for a simple shader replacement for matrix multiplication for small GPUs, but you can also leave working on that to me or others. If you are mainly interested in MoltenVK, maybe you can look into optimizing the Vulkan backend for Apple hardware? There's some straightforward matmul tuning that could help and more difficult shader optimization work yet to happen. |
Yes, let's close this one and focus on fine-tuning instead. |
@slp First step you can do is look into the matmul shader sizes and see which combination is optimal: https://github.com/ggerganov/llama.cpp/blob/198b1ec611a2c551ea40e5b9c0b862f37555a4cc/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L2637-L2644 The values set here are very old and came from an attempt at making it work at all. |
I was the one who created the PR to restrict Apple Silicon devices to M-sized MAT_MUL :-) I tested it with For token generation, given that the simpler shaders mat_mul (which are more or less straightforward translations of the Metal shaders) is giving almost the same number as the regular ones, I'm inclined to think that there's some penalty introduced by MoltenVK. Could be the way in which it manages the buffers, or perhaps due to some semantics lost during the SPIRV->MSL translation, or something else. In any case, the current performance of the Vulkan backend is so close to MSL that the difference is barely noticeable (if noticeable at all) when actually doing local inferencing, so I'd say we're good on that front. |
This PR enables the Vulkan backend to make use of the simpler kompute MAT_MUL shaders when operating with GPUs that can't deal with the regular shaders.
For the moment, the only GPUs enabled to use these shaders are the ones from Apple (since using them implies making use of MoltenVK, which is unable to translate the regular shaders properly), but can potentially be useful for other GPUs.
cc/ @0cc4m @ericcurtin