Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60 #11598

Open
dazipe opened this issue Feb 2, 2025 · 5 comments
Open

Comments

@dazipe
Copy link

dazipe commented Feb 2, 2025

Name and Version

llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
version: 4615 (bfcce4d)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Ubuntu 24.04.

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server -m ~/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf -c 72000 -ngl 99

Problem description & steps to reproduce

Hello,

The AMD Instinct MI60 cards have 32GB of VRAM. While using ROCm I can use the whole 32GB but with Vulcan it seems that one llama-server instance can access only 16GB.
I tested it with Qwen 2.5 7B 1M model with the context length up to 1 million) and I cannot start it with a context of more than 71K.
But at the same time I can start 2 instances with the 71K context length on the same card.

For example, two of these could be started at the same time:
llama-server -m ~/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf -c 71000 -ngl 99

However if I try to start just one with the 72K context I get the following error:

llama_init_from_model: KV self size  = 3937.50 MiB, K (f16): 1968.75 MiB, V (f16): 1968.75 MiB
llama_init_from_model: Vulkan_Host  output buffer size =     0.58 MiB
ggml_vulkan: Device memory allocation of size 4305588224 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4305588224
ggml_vulkan: Device memory allocation of size 4305588224 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4305588224
llama_init_from_model: failed to allocate compute buffers
common_init_from_params: failed to create context with model '/root/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf'
srv    load_model: failed to load model, '/root/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf'

I did try to disable the verification in ggml-vulkan.cpp and was able to increase context length to 220K while utilizing only 86% of the VRAM.
But while it was working I just started to receive gibberish after the context length exceeded 71K.

I tried different versions of Vulkan but the error remains.

@jeffbolznv
Copy link
Collaborator

Vulkan doesn't currently support more than 4GB in a single buffer, so if this large context size causes such an allocation then it's expected to fail.

@ThiloteE
Copy link
Contributor

ThiloteE commented Feb 2, 2025

Refs KhronosGroup/Vulkan-Docs#1016

@3Simplex
Copy link

3Simplex commented Feb 2, 2025

My buffer is only getting 2gb on windows with my rx6900xt. It should be capable of 4gb.

@dazipe
Copy link
Author

dazipe commented Feb 2, 2025

Thank you for your replies.
Maybe there are some ways to use multiple buffers of 4GB?
Any other suggestion on how to optimally use these cards would be appreciated.
My first choice was the ROCm HIP version of Llama.cpp but there is a huge VRAM overprovisioning. I mean it would load the model and KV cache into VRAM but then memory usage would further grow with the prompt processed.
So while I was able to use 71K context length in the Vulkan version with only 50% of VRAM, in the HIP version the 32GB were fully used after 35K context processing.

@jeffbolznv
Copy link
Collaborator

My buffer is only getting 2gb on windows with my rx6900xt. It should be capable of 4gb.

You could try setting the env var GGML_VK_FORCE_MAX_ALLOCATION_SIZE to something higher, but it may not work if the driver isn't claiming support.

Maybe there are some ways to use multiple buffers of 4GB?

I don't think so. Part of the problem is that Vulkan compilers can assume bindings are <4GB and use 32b addressing math. So even if you made a huge sparse buffer or something it probably wouldn't work. Unless there's an algorithmic change to split the kv cache?

Another option might be to use a quantized format to decrease the size of the kv cache (e.g. -ctk q8_0 -ctv q8_0 -fa) but it requires flash attention support which is currently not accelerated on AMD GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants