Feature Request: API for store/load raw KV-cache for only a selected sequence of tokens #9427

Nekotekina · 2024-09-11T07:03:23Z

Nekotekina
Sep 11, 2024

Motivation: I want to split context in chunks and store each chunk separately for further reloading and also research purposes. A chunk can be seen as a round of interaction between user and assistant, just as an example. In my project, I do lots of "context shifting" by evicting "middle" chunks, the total amount of tokens greatly exceeds the context size that's practically available with my limited GPU resources (around 8192 tokens). Serializing the whole state each time with existing API would be extremely inefficient. There are few possible uses for having precise KV-cache control API, for example:

Efficient precise rollback, cancelling context shift to the arbitrary state in the past
Efficient "branching" of chunk sequences
Some experiments inspired by https://arxiv.org/pdf/2402.04617 InfLLM which can be seen as "injection" of the most relevant KV-cache blocks in the middle of context

Nekotekina · 2024-09-11T14:00:45Z

Nekotekina
Sep 11, 2024
Author

Ok, maybe it's possible to implement with some help of llama_kv_cache_seq_cp. I'm not completely sure I understand it correctly though.

3 replies

ggerganov Sep 11, 2024
Maintainer

Could you propose an API that would work for your case?

Nekotekina Sep 14, 2024
Author

I'm currently trying llama_state_seq_save_file after seq_add and llama_kv_cache_update. If I understand correctly, if I don't call update, seq_add K-shift will be ignored (I'm trying to shift tokens to zero position before saving). In my case, storing is destructive, so I can afford modifying the saved part without requiring additional context. This all seems a bit error-prone, so I'd propose to add p0+p1 parameters to seq_save/load functions.

Nekotekina Sep 15, 2024
Author

Another problem looks like llama_state_seq_load_file fails due to cell fragmentations as it requires contiguous sequence of free cells, again, if I understand it correctly. I'll try defragmentation, but it's also pretty confusing. Maybe it should allow loading fragmented sequence instead?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: API for store/load raw KV-cache for only a selected sequence of tokens #9427

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Feature Request: API for store/load raw KV-cache for only a selected sequence of tokens #9427

Nekotekina Sep 11, 2024

Replies: 1 comment · 3 replies

Nekotekina Sep 11, 2024 Author

ggerganov Sep 11, 2024 Maintainer

Nekotekina Sep 14, 2024 Author

Nekotekina Sep 15, 2024 Author

Nekotekina
Sep 11, 2024

Replies: 1 comment 3 replies

Nekotekina
Sep 11, 2024
Author

ggerganov Sep 11, 2024
Maintainer

Nekotekina Sep 14, 2024
Author

Nekotekina Sep 15, 2024
Author