Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting PagedAttention KV cache allocation from context size #640

Merged
merged 4 commits into from
Jul 28, 2024

Conversation

EricLBuehler
Copy link
Owner

@EricLBuehler EricLBuehler commented Jul 28, 2024

Refs #622,

Copy link

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   11          102          101            0            1
 Python                 41         1586         1368           46          172
 TOML                   19          564          498           11           55
-------------------------------------------------------------------------------
 Jupyter Notebooks       2            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          196          169            1           26
 (Total)                            273          201           32           40
-------------------------------------------------------------------------------
 Markdown               24         1822            0         1372          450
 |- BASH                 5          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           92           82            0           10
 |- Rust                 6          402          359           19           24
 |- TOML                 2           75           63            0           12
 (Total)                           2504          614         1391          499
-------------------------------------------------------------------------------
 Rust                  168        54649        49605          978         4066
 |- Markdown            90          838           13          775           50
 (Total)                          55487        49618         1753         4116
===============================================================================
 Total                 270        59234        51994         2407         4833
===============================================================================
  

@EricLBuehler
Copy link
Owner Author

@oldgithubman @mcm007 this implements allocating KV cache allocation for a specific number of tokens, when using CUDA.

@oldgithubman:

The problem with dynamically-allocated cache (assuming no limiter functionality), is usually, if I'm targeting a certain performance level (for example, RAM), I don't want the cache to grow beyond RAM and even using all my RAM is bad. So, I think a limiter is still useful. Again, it could just be a matter of adapting to a new workflow though. Being able to prevent cache from using swap is very important though, since this adds wear and tear on expensive nvme's

I will add this in a future PR!

@EricLBuehler EricLBuehler merged commit 38fb942 into master Jul 28, 2024
15 checks passed
@EricLBuehler EricLBuehler deleted the pa_context_size branch July 28, 2024 08:14
@oldgithubman
Copy link

oldgithubman commented Jul 29, 2024

@EricLBuehler Is there any way you can make it attempt to allocate GPU layers + KV cache before allocating CPU layers? This would save a lot of time while testing what fits

As a longer-term goal, an option to first allocate KV cache as requested by the user, then automatically allocate as many layers as possible to GPU (perhaps with a buffer) before allocating the rest to CPU. This would eliminate the need for testing entirely

Also, have you investigated mmap? (a separate issue, I know)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants