-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60 #11598
Comments
Vulkan doesn't currently support more than 4GB in a single buffer, so if this large context size causes such an allocation then it's expected to fail. |
My buffer is only getting 2gb on windows with my rx6900xt. It should be capable of 4gb. |
Thank you for your replies. |
You could try setting the env var GGML_VK_FORCE_MAX_ALLOCATION_SIZE to something higher, but it may not work if the driver isn't claiming support.
I don't think so. Part of the problem is that Vulkan compilers can assume bindings are <4GB and use 32b addressing math. So even if you made a huge sparse buffer or something it probably wouldn't work. Unless there's an algorithmic change to split the kv cache? Another option might be to use a quantized format to decrease the size of the kv cache (e.g. |
Name and Version
llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
version: 4615 (bfcce4d)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Ubuntu 24.04.
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server -m ~/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf -c 72000 -ngl 99
Problem description & steps to reproduce
Hello,
The AMD Instinct MI60 cards have 32GB of VRAM. While using ROCm I can use the whole 32GB but with Vulcan it seems that one llama-server instance can access only 16GB.
I tested it with Qwen 2.5 7B 1M model with the context length up to 1 million) and I cannot start it with a context of more than 71K.
But at the same time I can start 2 instances with the 71K context length on the same card.
For example, two of these could be started at the same time:
llama-server -m ~/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf -c 71000 -ngl 99
However if I try to start just one with the 72K context I get the following error:
llama_init_from_model: KV self size = 3937.50 MiB, K (f16): 1968.75 MiB, V (f16): 1968.75 MiB
llama_init_from_model: Vulkan_Host output buffer size = 0.58 MiB
ggml_vulkan: Device memory allocation of size 4305588224 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4305588224
ggml_vulkan: Device memory allocation of size 4305588224 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4305588224
llama_init_from_model: failed to allocate compute buffers
common_init_from_params: failed to create context with model '/root/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf'
srv load_model: failed to load model, '/root/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf'
I did try to disable the verification in ggml-vulkan.cpp and was able to increase context length to 220K while utilizing only 86% of the VRAM.
But while it was working I just started to receive gibberish after the context length exceeded 71K.
I tried different versions of Vulkan but the error remains.
The text was updated successfully, but these errors were encountered: