Allow setting PagedAttention KV cache allocation from context size #640

EricLBuehler · 2024-07-28T00:44:09Z

Refs #622,

github-actions · 2024-07-28T00:45:18Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   11          102          101            0            1
 Python                 41         1586         1368           46          172
 TOML                   19          564          498           11           55
-------------------------------------------------------------------------------
 Jupyter Notebooks       2            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          196          169            1           26
 (Total)                            273          201           32           40
-------------------------------------------------------------------------------
 Markdown               24         1822            0         1372          450
 |- BASH                 5          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           92           82            0           10
 |- Rust                 6          402          359           19           24
 |- TOML                 2           75           63            0           12
 (Total)                           2504          614         1391          499
-------------------------------------------------------------------------------
 Rust                  168        54649        49605          978         4066
 |- Markdown            90          838           13          775           50
 (Total)                          55487        49618         1753         4116
===============================================================================
 Total                 270        59234        51994         2407         4833
===============================================================================

EricLBuehler · 2024-07-28T00:51:37Z

@oldgithubman @mcm007 this implements allocating KV cache allocation for a specific number of tokens, when using CUDA.

@oldgithubman:

The problem with dynamically-allocated cache (assuming no limiter functionality), is usually, if I'm targeting a certain performance level (for example, RAM), I don't want the cache to grow beyond RAM and even using all my RAM is bad. So, I think a limiter is still useful. Again, it could just be a matter of adapting to a new workflow though. Being able to prevent cache from using swap is very important though, since this adds wear and tear on expensive nvme's

I will add this in a future PR!

oldgithubman · 2024-07-29T07:23:35Z

@EricLBuehler Is there any way you can make it attempt to allocate GPU layers + KV cache before allocating CPU layers? This would save a lot of time while testing what fits

As a longer-term goal, an option to first allocate KV cache as requested by the user, then automatically allocate as many layers as possible to GPU (perhaps with a buffer) before allocating the rest to CPU. This would eliminate the need for testing entirely

Also, have you investigated mmap? (a separate issue, I know)

EricLBuehler added 3 commits July 27, 2024 20:20

Support paged attn memory allocation via context size

cc6a947

Slightly better logging

8a9591a

Connect it to the apis

cd179a6

Clippy

350c55a

EricLBuehler merged commit 38fb942 into master Jul 28, 2024
15 checks passed

EricLBuehler deleted the pa_context_size branch July 28, 2024 08:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow setting PagedAttention KV cache allocation from context size #640

Allow setting PagedAttention KV cache allocation from context size #640

EricLBuehler commented Jul 28, 2024 •

edited

Loading

github-actions bot commented Jul 28, 2024

EricLBuehler commented Jul 28, 2024

oldgithubman commented Jul 29, 2024 •

edited

Loading

Allow setting PagedAttention KV cache allocation from context size #640

Allow setting PagedAttention KV cache allocation from context size #640

Conversation

EricLBuehler commented Jul 28, 2024 • edited Loading

github-actions bot commented Jul 28, 2024

EricLBuehler commented Jul 28, 2024

oldgithubman commented Jul 29, 2024 • edited Loading

EricLBuehler commented Jul 28, 2024 •

edited

Loading

oldgithubman commented Jul 29, 2024 •

edited

Loading