vulkan: use kompute matmul shaders on embedded GPUs #11525

slp · 2025-01-30T18:14:37Z

This PR enables the Vulkan backend to make use of the simpler kompute MAT_MUL shaders when operating with GPUs that can't deal with the regular shaders.

For the moment, the only GPUs enabled to use these shaders are the ones from Apple (since using them implies making use of MoltenVK, which is unable to translate the regular shaders properly), but can potentially be useful for other GPUs.

cc/ @0cc4m @ericcurtin

slp · 2025-01-30T20:02:00Z

I've used variants of embedded to make a reference to the GPUs that may need to simpler shaders, but I'm not completely sold on that terminology. If someone comes up with a better one, I'd really appreciate it.

Also, if this PR makes sense and goes in, I think we should consider dropping the Kompute backend.

slp · 2025-01-30T20:38:19Z

My test machine is an Apple M3 Pro with 36 GB of RAM. Functionally speaking, this change allows me to use various models with the Vulkan backend both natively, on macOS, and in a libkrun/krunkit microVM (which uses virtio-gpu+venus) running Linux.

The performance numbers look like these:

macOS with the Metal backend:

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Metal,BLAS |       6 |         pp512 |        326.71 ± 0.08 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Metal,BLAS |       6 |         tg128 |         28.24 ± 0.02 |

build: b4d92a59 (4484)

macOS with the Vulkan backend:

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan,BLAS |       6 |         pp512 |         44.53 ± 0.06 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan,BLAS |       6 |         tg128 |         26.38 ± 0.09 |

microVM (Linux under libkrun) with the Vulkan backend:

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan     |  99 |         pp512 |         40.88 ± 0.06 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         23.38 ± 0.15 |

The huge hit in prompt processing is expected because the kompute shaders doesn't have proper mul_mm support, something I intend to fix soon.

The ~7% drop in token generation between native Metal vs. native Vulkan is unexpected, due the similarities between the MSL and GLSL implementations of mat_mul for q4_k, and deserves further investigation.

The ~17% drop in performance for token generation between native Metal vs. microVM Vulkan is also unexpected, and requires investigation also from the VMM (Virtual Machine Monitor) perspective, since Activity Monitor indicates that the GPU isn't fully saturated.

0cc4m

Very cool that you got it to work! I added a few comments, I think this can be simplified further. You're also still welcome to contact me on Discord (_occam), there I could assist more directly.

0cc4m · 2025-01-30T21:05:28Z

ggml/src/ggml-kompute/kompute

This seems to be unrelated.

Ouch, yes. I'll drop it.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

0cc4m · 2025-01-30T21:07:18Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

@@ -166,6 +166,7 @@ struct vk_device_struct {
    uint32_t subgroup_size;
    uint32_t shader_core_count;
    bool uma;
+    bool embedded;


embedded doesn't quite fit, I think. Maybe something like matmul_fallback or simple_shaders?

I think prefer simple_shaders over matmul_fallback, because we might need to go beyond matmut.

0cc4m · 2025-01-30T21:09:47Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

@@ -4358,6 +4388,167 @@ static void ggml_vk_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, c
    }
 }

+static void ggml_vkemb_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, bool dryrun = false) {


I think this can probably be handled with the main ggml_vk_mul_mat_q_f16 function. This would also give you the rest of the quants, since the function can do dequant in a separate shader + matmul in float16, and then you can avoid the separate backend and supports_op function.

Maybe switching the pipeline in ggml_vk_guess_matmul_pipeline and the parameter handling in ggml_vk_matmul?

The problem is that we need to reject OPs for MUL_MAT with unsupported quants (the complex implementations doesn't work) and whole OPs like MUL_MAT_ID. Other OPs also have trouble with certain params. I think the requirement of having a separate supports_op function is going to be very hard to avoid.

Maybe switching the pipeline in ggml_vk_guess_matmul_pipeline and the parameter handling in ggml_vk_matmul?

OK, let me see if I can make it fit.

I guess if the matmul shaders are problematic, maybe the dequants also don't work, so some quants may just not work.

The problem is that we need to reject OPs for MUL_MAT with unsupported quants (the complex implementations doesn't work) and whole OPs like MUL_MAT_ID. Other OPs also have trouble with certain params. I think the requirement of having a separate supports_op function is going to be very hard to avoid.

I understand. But at least you won't need to duplicate the backend if you add your function in front of the existing one and just jump directly to it if the device doesn't need the simple shaders.

0cc4m · 2025-01-30T21:14:23Z

ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp

-    std::string in_path = join_paths(input_dir, in_fname);
+    std::string in_path;
+    if (is_embed) {
+        in_path = join_paths(input_dir + "/../../ggml-kompute/kompute-shaders", in_fname);


I think it's better to copy shaders that we want to use, to keep the backends separated. This also means that we can do some Vulkan-backend-specific changes or optimizations.

OK. That's what I did at first, but felt weird having duplicated files. But if you prefer that option, I'm fine with it.

jeffbolznv · 2025-01-30T21:44:52Z

What is the long-term plan for these shaders? If MoltenVK fixes spec constants (or we can otherwise workaround it, which IMO is doable by using reasonable defaults for the spec constants), do we still keep these shaders?

0cc4m · 2025-01-30T21:49:17Z

What is the long-term plan for these shaders? If MoltenVK fixes spec constants (or we can otherwise workaround it, which IMO is doable by using reasonable defaults for the spec constants), do we still keep these shaders?

It's not just MoltenVK, Qualcomm also suffers from shader compiler bugs and there's other devices (for example Raspberry Pi) that have basic Vulkan support, but can't handle the current shaders. I think it's a good idea to have fallbacks for these cases, maybe look into how far you can optimize for these tiny GPUs.

ericcurtin

Changes look reasonable to me so far

0cc4m · 2025-02-03T17:43:18Z

I think you could completely avoid simpler_common.comp, it seems to serve the same purpose as types.comp and dequant_funcs.comp, but with some limitations like requiring 16-bit float math (which not all devices support). But there are a few differences, for example the number of values that get dequantized in one step.

It's not immediately necessary, though, we could also reduce redundancy later.

netrunnereve · 2025-02-04T01:46:15Z

Also, if this PR makes sense and goes in, I think we should consider dropping the Kompute backend.

This could replace the Qualcomm OpenCL backend as well if their compiler was able to handle these shaders. Does the Kompute backend currently work on phones, ARM boards, and so forth?

but with some limitations like requiring 16-bit float math (which not all devices support)

I was actually blocked by that when I tried to run the Kompute backend in the original PR (wow it's been a year already 😃). I only skimmed through the shaders but there are only a few places which actually do fp16 math and those can easily be switched to regular floats for compatibility. Then again those devices who can't run the full shaders are probably slow will need all the help they can get...

slp · 2025-02-04T14:37:33Z

I think you could completely avoid simpler_common.comp, it seems to serve the same purpose as types.comp and dequant_funcs.comp, but with some limitations like requiring 16-bit float math (which not all devices support). But there are a few differences, for example the number of values that get dequantized in one step.

It's not immediately necessary, though, we could also reduce redundancy later.

Yeah, I would rather do that later, after things settle down a bit.

but with some limitations like requiring 16-bit float math (which not all devices support)

I was actually blocked by that when I tried to run the Kompute backend in the original PR (wow it's been a year already 😃). I only skimmed through the shaders but there are only a few places which actually do fp16 math and those can easily be switched to regular floats for compatibility. Then again those devices who can't run the full shaders are probably slow will need all the help they can get...

I don't have the hardware to test (only thing I have is powered by an old Adreno 618), but seems like from 6xx Gen3 onwards, Adreno GPUs should support VK_KHR_16bit_storage.

0cc4m · 2025-02-04T14:50:56Z

I don't have the hardware to test (only thing I have is powered by an old Adreno 618), but seems like from 6xx Gen3 onwards, Adreno GPUs should support VK_KHR_16bit_storage.

Note the difference between VK_KHR_16bit_storage (required to load/store 16-bit floats), which the Vulkan backend already requires to work at all, and VK_KHR_shader_float16_int8 (required to run shaders that can do 16-bit float arithmetic using GL_EXT_shader_explicit_arithmetic_types_float16), which we don't require. In the places where that is used, we provide another shader variant, that only uses 32-bit float arithmetic.

But that is also not immediately necessary here, we can improve that later.

But to allow testing on other GPUs, can you add support for an environment variable to force simple shaders? I think the easiest way to implement this would be (example code, not tested):

static bool ggml_vk_device_use_simpler_shaders(vk::PhysicalDeviceProperties *props) {
    if (getenv("GGML_VK_FORCE_SIMPLE_SHADERS")) {
        return true;
    }
    [...]

netrunnereve · 2025-02-04T16:33:12Z

But to allow testing on other GPUs, can you add support for an environment variable to force simple shaders?

We should also add a ci build and test for simple shaders as well. Let me know if you need help with that.

0cc4m · 2025-02-04T19:34:03Z

But to allow testing on other GPUs, can you add support for an environment variable to force simple shaders?

We should also add a ci build and test for simple shaders as well. Let me know if you need help with that.

We're not testing other shader variants either (e.g. coopmat stuff), it's fine for now. Let's get it set up and figure out the code first.

slp · 2025-02-07T14:45:47Z

Turns out softmax was also problematic, even though test-backend-ops didn't reveal any problem with it. I've added simpler shaders for it too.

@0cc4m do you think we could merge this one in its current state and improve on it later?

0cc4m · 2025-02-07T14:50:33Z

@0cc4m do you think we could merge this one in its current state and improve on it later?

Yeah, sure, I'll take a look and make sure that it doesn't break anything else.

Just FYI, there was a bug/lacking feature with MoltenVK that may have been the source of the shader trouble, a (hacky) fix was just proposed that fixes this: ollama/ollama#1016 (comment)

jeffbolznv · 2025-02-07T14:51:50Z

Is soft_max only problematic because it also uses an array sized by a spec constant? There's a MoltenVK fix in progress for that, so a lot of this should become unnecessary soon.

If you found a bug that test-backend-ops isn't catching, can you add a new case to test-backend-ops to cover it?

slp · 2025-02-07T15:15:51Z

Just FYI, there was a bug/lacking feature with MoltenVK that may have been the source of the shader trouble, a (hacky) fix was just proposed that fixes this: ollama/ollama#1016 (comment)

Yup, I hope that MoltenVK will be able to eventually deal with the regular shaders, but that's going to take a while to make into people's computers. I also suspect some GPUs (namely, QCs) will still need the simpler shaders.

Is soft_max only problematic because it also uses an array sized by a spec constant? There's a MoltenVK fix in progress for that, so a lot of this should become unnecessary soon.

By the way it fails, I suspect there's something else. I'd like to find some time next week to give that MoltenVK fix a try to see if it fixes softmax too or not.

If you found a bug that test-backend-ops isn't catching, can you add a new case to test-backend-ops to cover it?

The problem is that I know it fails because the model derails, and I found out to be softmax by disabling/enabling OPs individually, but I don't know the parameters which made softmax fail. Is there a reasonable way to find it out? I've tried building with GGML_VULKAN_CHECK_RESULTS, but it fails early running the model and the values from the CPU (the ones used as reference) look completely bogus.

jeffbolznv · 2025-02-07T15:34:14Z

Is there a reasonable way to find it out?

Maybe just print out the parameters for the softmax, usually there are only a handful of unique combinations in a model.

slp · 2025-02-07T15:40:33Z

Is there a reasonable way to find it out?

Maybe just print out the parameters for the softmax, usually there are only a handful of unique combinations in a model.

Ack, I'll give it a try, thanks!

Import simpler matmul shaders from the kompute backend and use them on GPUs know to not be able to use the regular ones. Signed-off-by: Sergio Lopez <slp@redhat.com>

Even though the regular softmax shaders successfully pass test-backend-ops with Apple GPUs, running long inference tests has shown the models end derailing with softmax OPs being the root cause. With this commit, we use simpler softmax shaders borrowed from the Kompute backend (which are basically reimplementations of the Metal shaders) on certain GPUs know to have problem with the regular ones. Signed-off-by: Sergio Lopez <slp@redhat.com>

It was found that softmax vulkan shaders may fail on some contexts when ne00 > 1024, so let's add a test for it. Signed-off-by: Sergio Lopez <slp@redhat.com>

slp · 2025-02-10T17:35:15Z

Is there a reasonable way to find it out?

Maybe just print out the parameters for the softmax, usually there are only a handful of unique combinations in a model.

I've confirmed the problem arises when ne00 > 1024, and added a test for that.

I've also confirmed that KhronosGroup/MoltenVK#2441 fixes the problem with Apple GPUs, so unless this is needed for Qualcomm GPUs (I don't have a 8cx to test it), we can probably close this one without merging.

cebtenzzre · 2025-02-11T21:09:47Z

Does the Kompute backend currently work on phones, ARM boards, and so forth?

No, it actually doesn't even support Intel GPUs. We have also tried on a Qualcomm Adreno X1-85 without success. Only NVIDIA and AMD GPUs are well-supported by these shaders in our experience. Not surprising that they are compatible with MoltenVK since they were based on the Metal shaders.

0cc4m · 2025-02-12T08:30:39Z

I've also confirmed that KhronosGroup/MoltenVK#2441 fixes the problem with Apple GPUs, so unless this is needed for Qualcomm GPUs (I don't have a 8cx to test it), we can probably close this one without merging.

That's up to you. There could at least be a use for a simple shader replacement for matrix multiplication for small GPUs, but you can also leave working on that to me or others.

If you are mainly interested in MoltenVK, maybe you can look into optimizing the Vulkan backend for Apple hardware? There's some straightforward matmul tuning that could help and more difficult shader optimization work yet to happen.

slp · 2025-02-12T11:35:59Z

If you are mainly interested in MoltenVK, maybe you can look into optimizing the Vulkan backend for Apple hardware? There's some straightforward matmul tuning that could help and more difficult shader optimization work yet to happen.

Yes, let's close this one and focus on fine-tuning instead.

0cc4m · 2025-02-12T12:25:33Z

@slp First step you can do is look into the matmul shader sizes and see which combination is optimal: https://github.com/ggerganov/llama.cpp/blob/198b1ec611a2c551ea40e5b9c0b862f37555a4cc/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L2637-L2644

The values set here are very old and came from an attempt at making it work at all.

slp · 2025-02-12T13:37:17Z

@slp First step you can do is look into the matmul shader sizes and see which combination is optimal:

https://github.com/ggerganov/llama.cpp/blob/198b1ec611a2c551ea40e5b9c0b862f37555a4cc/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L2637-L2644

The values set here are very old and came from an attempt at making it work at all.

I was the one who created the PR to restrict Apple Silicon devices to M-sized MAT_MUL :-)

I tested it with llama-bench a couple days ago, and enabling any other size reduces the performance a little bit for token generation, and by a very large percentage for prompt processing.

For token generation, given that the simpler shaders mat_mul (which are more or less straightforward translations of the Metal shaders) is giving almost the same number as the regular ones, I'm inclined to think that there's some penalty introduced by MoltenVK. Could be the way in which it manages the buffers, or perhaps due to some semantics lost during the SPIRV->MSL translation, or something else.

In any case, the current performance of the Vulkan backend is so close to MSL that the difference is barely noticeable (if noticeable at all) when actually doing local inferencing, so I'd say we're good on that front.

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ labels Jan 30, 2025

0cc4m self-requested a review January 30, 2025 20:04

0cc4m reviewed Jan 30, 2025

View reviewed changes

ericcurtin approved these changes Feb 1, 2025

View reviewed changes

slp force-pushed the vulkan-embedded branch 2 times, most recently from 301332a to 7855aeb Compare February 3, 2025 17:33

slp marked this pull request as ready for review February 3, 2025 17:39

slp added 3 commits February 10, 2025 11:52

vulkan: enable the use of simpler matmul shaders

8b47fce

Import simpler matmul shaders from the kompute backend and use them on GPUs know to not be able to use the regular ones. Signed-off-by: Sergio Lopez <slp@redhat.com>

tests: test softmax with ne00 == 2048

da1e744

It was found that softmax vulkan shaders may fail on some contexts when ne00 > 1024, so let's add a test for it. Signed-off-by: Sergio Lopez <slp@redhat.com>

slp force-pushed the vulkan-embedded branch from aad3a4b to da1e744 Compare February 10, 2025 10:59

github-actions bot added the testing Everything test related label Feb 10, 2025

slp closed this Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: use kompute matmul shaders on embedded GPUs #11525

vulkan: use kompute matmul shaders on embedded GPUs #11525

slp commented Jan 30, 2025

slp commented Jan 30, 2025

slp commented Jan 30, 2025 •

edited

Loading

0cc4m left a comment

0cc4m Jan 30, 2025

slp Jan 30, 2025

0cc4m Jan 30, 2025

slp Jan 30, 2025

0cc4m Jan 30, 2025

slp Jan 30, 2025

0cc4m Jan 30, 2025

0cc4m Jan 30, 2025

slp Jan 30, 2025 •

edited

Loading

jeffbolznv commented Jan 30, 2025

0cc4m commented Jan 30, 2025

ericcurtin left a comment

0cc4m commented Feb 3, 2025

netrunnereve commented Feb 4, 2025

slp commented Feb 4, 2025

0cc4m commented Feb 4, 2025

netrunnereve commented Feb 4, 2025 •

edited

Loading

0cc4m commented Feb 4, 2025

slp commented Feb 7, 2025

0cc4m commented Feb 7, 2025

jeffbolznv commented Feb 7, 2025

slp commented Feb 7, 2025

jeffbolznv commented Feb 7, 2025

slp commented Feb 7, 2025

slp commented Feb 10, 2025

cebtenzzre commented Feb 11, 2025 •

edited

Loading

0cc4m commented Feb 12, 2025

slp commented Feb 12, 2025

0cc4m commented Feb 12, 2025

slp commented Feb 12, 2025 •

edited

Loading

vulkan: use kompute matmul shaders on embedded GPUs #11525

vulkan: use kompute matmul shaders on embedded GPUs #11525

Conversation

slp commented Jan 30, 2025

slp commented Jan 30, 2025

slp commented Jan 30, 2025 • edited Loading

0cc4m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slp Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

jeffbolznv commented Jan 30, 2025

0cc4m commented Jan 30, 2025

ericcurtin left a comment

Choose a reason for hiding this comment

0cc4m commented Feb 3, 2025

netrunnereve commented Feb 4, 2025

slp commented Feb 4, 2025

0cc4m commented Feb 4, 2025

netrunnereve commented Feb 4, 2025 • edited Loading

0cc4m commented Feb 4, 2025

slp commented Feb 7, 2025

0cc4m commented Feb 7, 2025

jeffbolznv commented Feb 7, 2025

slp commented Feb 7, 2025

jeffbolznv commented Feb 7, 2025

slp commented Feb 7, 2025

slp commented Feb 10, 2025

cebtenzzre commented Feb 11, 2025 • edited Loading

0cc4m commented Feb 12, 2025

slp commented Feb 12, 2025

0cc4m commented Feb 12, 2025

slp commented Feb 12, 2025 • edited Loading

slp commented Jan 30, 2025 •

edited

Loading

slp Jan 30, 2025 •

edited

Loading

netrunnereve commented Feb 4, 2025 •

edited

Loading

cebtenzzre commented Feb 11, 2025 •

edited

Loading

slp commented Feb 12, 2025 •

edited

Loading