Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda : fix defrag with quantized KV #9319

Merged
merged 1 commit into from
Sep 5, 2024
Merged

cuda : fix defrag with quantized KV #9319

merged 1 commit into from
Sep 5, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Sep 5, 2024

There were several issues with KV defragmentation with quantized KV:

  • Requires ggml_cpy from quant to quant, which was not supported in the CUDA backend
  • ggml_backend_sched cannot fallback to the CPU backend either when the destination is pre-allocated, which was not correctly detected
  • Trying to do so would result in a buffer overflow in the graph leafs, which results in a crash

This fixes the issues in ggml_backend_sched and adds support to the CUDA backend for ggml_cpy when the types are the same and the tensors are contiguous (using cudaMemcpyAsync).

Other backends may also be affected.

Fixes #9314

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 5, 2024
@slaren slaren merged commit 4db0478 into master Sep 5, 2024
53 checks passed
@slaren slaren deleted the sl/fix-cuda-defrag branch September 5, 2024 09:13
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 5, 2024
cuda : fix defrag with quantized KV (ggerganov#9319)
MaggotHATE added a commit to MaggotHATE/Llama_chat that referenced this pull request Sep 11, 2024
* Important: this guards assert in ggml-backend.c introduced in ggerganov/llama.cpp#9319 , be aware
* Merged recent Seed commit
* Added a small .txt guide on code that needs to be added to make clblast work on current llama.cpp
* minor display styling
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: llama-server crash when defragmenting (llama_kv_cache_defrag_internal)
2 participants