Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: use kompute matmul shaders on embedded GPUs #11525

Closed
wants to merge 3 commits into from

Conversation

slp
Copy link
Collaborator

@slp slp commented Jan 30, 2025

This PR enables the Vulkan backend to make use of the simpler kompute MAT_MUL shaders when operating with GPUs that can't deal with the regular shaders.

For the moment, the only GPUs enabled to use these shaders are the ones from Apple (since using them implies making use of MoltenVK, which is unable to translate the regular shaders properly), but can potentially be useful for other GPUs.

cc/ @0cc4m @ericcurtin

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ labels Jan 30, 2025
@slp
Copy link
Collaborator Author

slp commented Jan 30, 2025

I've used variants of embedded to make a reference to the GPUs that may need to simpler shaders, but I'm not completely sold on that terminology. If someone comes up with a better one, I'd really appreciate it.

Also, if this PR makes sense and goes in, I think we should consider dropping the Kompute backend.

@0cc4m 0cc4m self-requested a review January 30, 2025 20:04
@slp
Copy link
Collaborator Author

slp commented Jan 30, 2025

My test machine is an Apple M3 Pro with 36 GB of RAM. Functionally speaking, this change allows me to use various models with the Vulkan backend both natively, on macOS, and in a libkrun/krunkit microVM (which uses virtio-gpu+venus) running Linux.

The performance numbers look like these:

  • macOS with the Metal backend:
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Metal,BLAS |       6 |         pp512 |        326.71 ± 0.08 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Metal,BLAS |       6 |         tg128 |         28.24 ± 0.02 |

build: b4d92a59 (4484)
  • macOS with the Vulkan backend:
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan,BLAS |       6 |         pp512 |         44.53 ± 0.06 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan,BLAS |       6 |         tg128 |         26.38 ± 0.09 |
  • microVM (Linux under libkrun) with the Vulkan backend:
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan     |  99 |         pp512 |         40.88 ± 0.06 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         23.38 ± 0.15 |

The huge hit in prompt processing is expected because the kompute shaders doesn't have proper mul_mm support, something I intend to fix soon.

The ~7% drop in token generation between native Metal vs. native Vulkan is unexpected, due the similarities between the MSL and GLSL implementations of mat_mul for q4_k, and deserves further investigation.

The ~17% drop in performance for token generation between native Metal vs. microVM Vulkan is also unexpected, and requires investigation also from the VMM (Virtual Machine Monitor) perspective, since Activity Monitor indicates that the GPU isn't fully saturated.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool that you got it to work! I added a few comments, I think this can be simplified further. You're also still welcome to contact me on Discord (_occam), there I could assist more directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be unrelated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, yes. I'll drop it.

@@ -166,6 +166,7 @@ struct vk_device_struct {
uint32_t subgroup_size;
uint32_t shader_core_count;
bool uma;
bool embedded;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedded doesn't quite fit, I think. Maybe something like matmul_fallback or simple_shaders?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think prefer simple_shaders over matmul_fallback, because we might need to go beyond matmut.

@@ -4358,6 +4388,167 @@ static void ggml_vk_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, c
}
}

static void ggml_vkemb_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, bool dryrun = false) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can probably be handled with the main ggml_vk_mul_mat_q_f16 function. This would also give you the rest of the quants, since the function can do dequant in a separate shader + matmul in float16, and then you can avoid the separate backend and supports_op function.

Maybe switching the pipeline in ggml_vk_guess_matmul_pipeline and the parameter handling in ggml_vk_matmul?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we need to reject OPs for MUL_MAT with unsupported quants (the complex implementations doesn't work) and whole OPs like MUL_MAT_ID. Other OPs also have trouble with certain params. I think the requirement of having a separate supports_op function is going to be very hard to avoid.

Maybe switching the pipeline in ggml_vk_guess_matmul_pipeline and the parameter handling in ggml_vk_matmul?

OK, let me see if I can make it fit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if the matmul shaders are problematic, maybe the dequants also don't work, so some quants may just not work.

The problem is that we need to reject OPs for MUL_MAT with unsupported quants (the complex implementations doesn't work) and whole OPs like MUL_MAT_ID. Other OPs also have trouble with certain params. I think the requirement of having a separate supports_op function is going to be very hard to avoid.

I understand. But at least you won't need to duplicate the backend if you add your function in front of the existing one and just jump directly to it if the device doesn't need the simple shaders.

std::string in_path = join_paths(input_dir, in_fname);
std::string in_path;
if (is_embed) {
in_path = join_paths(input_dir + "/../../ggml-kompute/kompute-shaders", in_fname);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to copy shaders that we want to use, to keep the backends separated. This also means that we can do some Vulkan-backend-specific changes or optimizations.

Copy link
Collaborator Author

@slp slp Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That's what I did at first, but felt weird having duplicated files. But if you prefer that option, I'm fine with it.

@jeffbolznv
Copy link
Collaborator

What is the long-term plan for these shaders? If MoltenVK fixes spec constants (or we can otherwise workaround it, which IMO is doable by using reasonable defaults for the spec constants), do we still keep these shaders?

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 30, 2025

What is the long-term plan for these shaders? If MoltenVK fixes spec constants (or we can otherwise workaround it, which IMO is doable by using reasonable defaults for the spec constants), do we still keep these shaders?

It's not just MoltenVK, Qualcomm also suffers from shader compiler bugs and there's other devices (for example Raspberry Pi) that have basic Vulkan support, but can't handle the current shaders. I think it's a good idea to have fallbacks for these cases, maybe look into how far you can optimize for these tiny GPUs.

Copy link
Collaborator

@ericcurtin ericcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look reasonable to me so far

@slp slp force-pushed the vulkan-embedded branch 2 times, most recently from 301332a to 7855aeb Compare February 3, 2025 17:33
@slp slp marked this pull request as ready for review February 3, 2025 17:39
@0cc4m
Copy link
Collaborator

0cc4m commented Feb 3, 2025

I think you could completely avoid simpler_common.comp, it seems to serve the same purpose as types.comp and dequant_funcs.comp, but with some limitations like requiring 16-bit float math (which not all devices support). But there are a few differences, for example the number of values that get dequantized in one step.

It's not immediately necessary, though, we could also reduce redundancy later.

@netrunnereve
Copy link
Collaborator

Also, if this PR makes sense and goes in, I think we should consider dropping the Kompute backend.

This could replace the Qualcomm OpenCL backend as well if their compiler was able to handle these shaders. Does the Kompute backend currently work on phones, ARM boards, and so forth?

but with some limitations like requiring 16-bit float math (which not all devices support)

I was actually blocked by that when I tried to run the Kompute backend in the original PR (wow it's been a year already 😃). I only skimmed through the shaders but there are only a few places which actually do fp16 math and those can easily be switched to regular floats for compatibility. Then again those devices who can't run the full shaders are probably slow will need all the help they can get...

@slp
Copy link
Collaborator Author

slp commented Feb 4, 2025

I think you could completely avoid simpler_common.comp, it seems to serve the same purpose as types.comp and dequant_funcs.comp, but with some limitations like requiring 16-bit float math (which not all devices support). But there are a few differences, for example the number of values that get dequantized in one step.

It's not immediately necessary, though, we could also reduce redundancy later.

Yeah, I would rather do that later, after things settle down a bit.

but with some limitations like requiring 16-bit float math (which not all devices support)

I was actually blocked by that when I tried to run the Kompute backend in the original PR (wow it's been a year already 😃). I only skimmed through the shaders but there are only a few places which actually do fp16 math and those can easily be switched to regular floats for compatibility. Then again those devices who can't run the full shaders are probably slow will need all the help they can get...

I don't have the hardware to test (only thing I have is powered by an old Adreno 618), but seems like from 6xx Gen3 onwards, Adreno GPUs should support VK_KHR_16bit_storage.

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 4, 2025

I don't have the hardware to test (only thing I have is powered by an old Adreno 618), but seems like from 6xx Gen3 onwards, Adreno GPUs should support VK_KHR_16bit_storage.

Note the difference between VK_KHR_16bit_storage (required to load/store 16-bit floats), which the Vulkan backend already requires to work at all, and VK_KHR_shader_float16_int8 (required to run shaders that can do 16-bit float arithmetic using GL_EXT_shader_explicit_arithmetic_types_float16), which we don't require. In the places where that is used, we provide another shader variant, that only uses 32-bit float arithmetic.

But that is also not immediately necessary here, we can improve that later.

But to allow testing on other GPUs, can you add support for an environment variable to force simple shaders? I think the easiest way to implement this would be (example code, not tested):

static bool ggml_vk_device_use_simpler_shaders(vk::PhysicalDeviceProperties *props) {
    if (getenv("GGML_VK_FORCE_SIMPLE_SHADERS")) {
        return true;
    }
    [...]

@netrunnereve
Copy link
Collaborator

netrunnereve commented Feb 4, 2025

But to allow testing on other GPUs, can you add support for an environment variable to force simple shaders?

We should also add a ci build and test for simple shaders as well. Let me know if you need help with that.

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 4, 2025

But to allow testing on other GPUs, can you add support for an environment variable to force simple shaders?

We should also add a ci build and test for simple shaders as well. Let me know if you need help with that.

We're not testing other shader variants either (e.g. coopmat stuff), it's fine for now. Let's get it set up and figure out the code first.

@slp
Copy link
Collaborator Author

slp commented Feb 7, 2025

Turns out softmax was also problematic, even though test-backend-ops didn't reveal any problem with it. I've added simpler shaders for it too.

@0cc4m do you think we could merge this one in its current state and improve on it later?

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 7, 2025

@0cc4m do you think we could merge this one in its current state and improve on it later?

Yeah, sure, I'll take a look and make sure that it doesn't break anything else.

Just FYI, there was a bug/lacking feature with MoltenVK that may have been the source of the shader trouble, a (hacky) fix was just proposed that fixes this: ollama/ollama#1016 (comment)

@jeffbolznv
Copy link
Collaborator

Is soft_max only problematic because it also uses an array sized by a spec constant? There's a MoltenVK fix in progress for that, so a lot of this should become unnecessary soon.

If you found a bug that test-backend-ops isn't catching, can you add a new case to test-backend-ops to cover it?

@slp
Copy link
Collaborator Author

slp commented Feb 7, 2025

Just FYI, there was a bug/lacking feature with MoltenVK that may have been the source of the shader trouble, a (hacky) fix was just proposed that fixes this: ollama/ollama#1016 (comment)

Yup, I hope that MoltenVK will be able to eventually deal with the regular shaders, but that's going to take a while to make into people's computers. I also suspect some GPUs (namely, QCs) will still need the simpler shaders.

Is soft_max only problematic because it also uses an array sized by a spec constant? There's a MoltenVK fix in progress for that, so a lot of this should become unnecessary soon.

By the way it fails, I suspect there's something else. I'd like to find some time next week to give that MoltenVK fix a try to see if it fixes softmax too or not.

If you found a bug that test-backend-ops isn't catching, can you add a new case to test-backend-ops to cover it?

The problem is that I know it fails because the model derails, and I found out to be softmax by disabling/enabling OPs individually, but I don't know the parameters which made softmax fail. Is there a reasonable way to find it out? I've tried building with GGML_VULKAN_CHECK_RESULTS, but it fails early running the model and the values from the CPU (the ones used as reference) look completely bogus.

@jeffbolznv
Copy link
Collaborator

Is there a reasonable way to find it out?

Maybe just print out the parameters for the softmax, usually there are only a handful of unique combinations in a model.

@slp
Copy link
Collaborator Author

slp commented Feb 7, 2025

Is there a reasonable way to find it out?

Maybe just print out the parameters for the softmax, usually there are only a handful of unique combinations in a model.

Ack, I'll give it a try, thanks!

slp added 3 commits February 10, 2025 11:52
Import simpler matmul shaders from the kompute backend and use them on
GPUs know to not be able to use the regular ones.

Signed-off-by: Sergio Lopez <slp@redhat.com>
Even though the regular softmax shaders successfully pass
test-backend-ops with Apple GPUs, running long inference tests has shown
the models end derailing with softmax OPs being the root cause.

With this commit, we use simpler softmax shaders borrowed from the
Kompute backend (which are basically reimplementations of the Metal
shaders) on certain GPUs know to have problem with the regular ones.

Signed-off-by: Sergio Lopez <slp@redhat.com>
It was found that softmax vulkan shaders may fail on some contexts when
ne00 > 1024, so let's add a test for it.

Signed-off-by: Sergio Lopez <slp@redhat.com>
@github-actions github-actions bot added the testing Everything test related label Feb 10, 2025
@slp
Copy link
Collaborator Author

slp commented Feb 10, 2025

Is there a reasonable way to find it out?

Maybe just print out the parameters for the softmax, usually there are only a handful of unique combinations in a model.

I've confirmed the problem arises when ne00 > 1024, and added a test for that.

I've also confirmed that KhronosGroup/MoltenVK#2441 fixes the problem with Apple GPUs, so unless this is needed for Qualcomm GPUs (I don't have a 8cx to test it), we can probably close this one without merging.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Feb 11, 2025

Does the Kompute backend currently work on phones, ARM boards, and so forth?

No, it actually doesn't even support Intel GPUs. We have also tried on a Qualcomm Adreno X1-85 without success. Only NVIDIA and AMD GPUs are well-supported by these shaders in our experience. Not surprising that they are compatible with MoltenVK since they were based on the Metal shaders.

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 12, 2025

I've also confirmed that KhronosGroup/MoltenVK#2441 fixes the problem with Apple GPUs, so unless this is needed for Qualcomm GPUs (I don't have a 8cx to test it), we can probably close this one without merging.

That's up to you. There could at least be a use for a simple shader replacement for matrix multiplication for small GPUs, but you can also leave working on that to me or others.

If you are mainly interested in MoltenVK, maybe you can look into optimizing the Vulkan backend for Apple hardware? There's some straightforward matmul tuning that could help and more difficult shader optimization work yet to happen.

@slp
Copy link
Collaborator Author

slp commented Feb 12, 2025

If you are mainly interested in MoltenVK, maybe you can look into optimizing the Vulkan backend for Apple hardware? There's some straightforward matmul tuning that could help and more difficult shader optimization work yet to happen.

Yes, let's close this one and focus on fine-tuning instead.

@slp slp closed this Feb 12, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Feb 12, 2025

@slp First step you can do is look into the matmul shader sizes and see which combination is optimal: https://github.com/ggerganov/llama.cpp/blob/198b1ec611a2c551ea40e5b9c0b862f37555a4cc/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L2637-L2644

The values set here are very old and came from an attempt at making it work at all.

@slp
Copy link
Collaborator Author

slp commented Feb 12, 2025

@slp First step you can do is look into the matmul shader sizes and see which combination is optimal:

https://github.com/ggerganov/llama.cpp/blob/198b1ec611a2c551ea40e5b9c0b862f37555a4cc/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L2637-L2644

The values set here are very old and came from an attempt at making it work at all.

I was the one who created the PR to restrict Apple Silicon devices to M-sized MAT_MUL :-)

I tested it with llama-bench a couple days ago, and enabling any other size reduces the performance a little bit for token generation, and by a very large percentage for prompt processing.

For token generation, given that the simpler shaders mat_mul (which are more or less straightforward translations of the Metal shaders) is giving almost the same number as the regular ones, I'm inclined to think that there's some penalty introduced by MoltenVK. Could be the way in which it manages the buffers, or perhaps due to some semantics lost during the SPIRV->MSL translation, or something else.

In any case, the current performance of the Vulkan backend is so close to MSL that the difference is barely noticeable (if noticeable at all) when actually doing local inferencing, so I'd say we're good on that front.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants