Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60 #11598

dazipe · 2025-02-02T17:21:05Z

Name and Version

llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
version: 4615 (bfcce4d)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Ubuntu 24.04.

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server -m ~/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf -c 72000 -ngl 99

Problem description & steps to reproduce

Hello,

The AMD Instinct MI60 cards have 32GB of VRAM. While using ROCm I can use the whole 32GB but with Vulcan it seems that one llama-server instance can access only 16GB.
I tested it with Qwen 2.5 7B 1M model with the context length up to 1 million) and I cannot start it with a context of more than 71K.
But at the same time I can start 2 instances with the 71K context length on the same card.

For example, two of these could be started at the same time:
llama-server -m ~/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf -c 71000 -ngl 99

However if I try to start just one with the 72K context I get the following error:

llama_init_from_model: KV self size = 3937.50 MiB, K (f16): 1968.75 MiB, V (f16): 1968.75 MiB
llama_init_from_model: Vulkan_Host output buffer size = 0.58 MiB
ggml_vulkan: Device memory allocation of size 4305588224 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4305588224
ggml_vulkan: Device memory allocation of size 4305588224 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4305588224
llama_init_from_model: failed to allocate compute buffers
common_init_from_params: failed to create context with model '/root/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf'
srv load_model: failed to load model, '/root/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf'

I did try to disable the verification in ggml-vulkan.cpp and was able to increase context length to 220K while utilizing only 86% of the VRAM.
But while it was working I just started to receive gibberish after the context length exceeded 71K.

I tried different versions of Vulkan but the error remains.

The text was updated successfully, but these errors were encountered:

jeffbolznv · 2025-02-02T17:46:50Z

Vulkan doesn't currently support more than 4GB in a single buffer, so if this large context size causes such an allocation then it's expected to fail.

ThiloteE · 2025-02-02T18:56:09Z

Refs KhronosGroup/Vulkan-Docs#1016

3Simplex · 2025-02-02T19:01:53Z

My buffer is only getting 2gb on windows with my rx6900xt. It should be capable of 4gb.

dazipe · 2025-02-02T19:31:26Z

Thank you for your replies.
Maybe there are some ways to use multiple buffers of 4GB?
Any other suggestion on how to optimally use these cards would be appreciated.
My first choice was the ROCm HIP version of Llama.cpp but there is a huge VRAM overprovisioning. I mean it would load the model and KV cache into VRAM but then memory usage would further grow with the prompt processed.
So while I was able to use 71K context length in the Vulkan version with only 50% of VRAM, in the HIP version the 32GB were fully used after 35K context processing.

jeffbolznv · 2025-02-02T20:37:12Z

My buffer is only getting 2gb on windows with my rx6900xt. It should be capable of 4gb.

You could try setting the env var GGML_VK_FORCE_MAX_ALLOCATION_SIZE to something higher, but it may not work if the driver isn't claiming support.

Maybe there are some ways to use multiple buffers of 4GB?

I don't think so. Part of the problem is that Vulkan compilers can assume bindings are <4GB and use 32b addressing math. So even if you made a huge sparse buffer or something it probably wouldn't work. Unless there's an algorithmic change to split the kv cache?

Another option might be to use a quantized format to decrease the size of the kv cache (e.g. -ctk q8_0 -ctv q8_0 -fa) but it requires flash attention support which is currently not accelerated on AMD GPUs.

dazipe added the bug-unconfirmed label Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60 #11598

Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60 #11598

dazipe commented Feb 2, 2025 •

edited

Loading

jeffbolznv commented Feb 2, 2025

ThiloteE commented Feb 2, 2025

3Simplex commented Feb 2, 2025

dazipe commented Feb 2, 2025 •

edited

Loading

jeffbolznv commented Feb 2, 2025

Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60 #11598

Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60 #11598

Comments

dazipe commented Feb 2, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

jeffbolznv commented Feb 2, 2025

ThiloteE commented Feb 2, 2025

3Simplex commented Feb 2, 2025

dazipe commented Feb 2, 2025 • edited Loading

jeffbolznv commented Feb 2, 2025

dazipe commented Feb 2, 2025 •

edited

Loading

dazipe commented Feb 2, 2025 •

edited

Loading