Feature Request: API for store/load raw KV-cache for only a selected sequence of tokens #9427
Nekotekina
started this conversation in
Ideas
Replies: 1 comment 3 replies
-
Ok, maybe it's possible to implement with some help of llama_kv_cache_seq_cp. I'm not completely sure I understand it correctly though. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Motivation: I want to split context in chunks and store each chunk separately for further reloading and also research purposes. A chunk can be seen as a round of interaction between user and assistant, just as an example. In my project, I do lots of "context shifting" by evicting "middle" chunks, the total amount of tokens greatly exceeds the context size that's practically available with my limited GPU resources (around 8192 tokens). Serializing the whole state each time with existing API would be extremely inefficient. There are few possible uses for having precise KV-cache control API, for example:
Beta Was this translation helpful? Give feedback.
All reactions