-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow setting PagedAttention KV cache allocation from context size #640
Conversation
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 2 35 28 0 7 Dockerfile 1 34 25 0 9 Happy 1 442 369 0 73 JSON 11 102 101 0 1 Python 41 1586 1368 46 172 TOML 19 564 498 11 55 ------------------------------------------------------------------------------- Jupyter Notebooks 2 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 196 169 1 26 (Total) 273 201 32 40 ------------------------------------------------------------------------------- Markdown 24 1822 0 1372 450 |- BASH 5 101 98 0 3 |- JSON 1 12 12 0 0 |- Python 5 92 82 0 10 |- Rust 6 402 359 19 24 |- TOML 2 75 63 0 12 (Total) 2504 614 1391 499 ------------------------------------------------------------------------------- Rust 168 54649 49605 978 4066 |- Markdown 90 838 13 775 50 (Total) 55487 49618 1753 4116 =============================================================================== Total 270 59234 51994 2407 4833 =============================================================================== |
@oldgithubman @mcm007 this implements allocating KV cache allocation for a specific number of tokens, when using CUDA.
I will add this in a future PR! |
@EricLBuehler Is there any way you can make it attempt to allocate GPU layers + KV cache before allocating CPU layers? This would save a lot of time while testing what fits As a longer-term goal, an option to first allocate KV cache as requested by the user, then automatically allocate as many layers as possible to GPU (perhaps with a buffer) before allocating the rest to CPU. This would eliminate the need for testing entirely Also, have you investigated mmap? (a separate issue, I know) |
Refs #622,