Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : reuse context chunks #9866

Merged
merged 1 commit into from
Oct 13, 2024
Merged

server : reuse context chunks #9866

merged 1 commit into from
Oct 13, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 12, 2024

ref #5793

Overview

Using a positive --cache-reuse argument with llama-server will attempt to reuse KV chunks with size equal or larger than the specified value. The KV cache of reused chunks will be shifted (see llama_kv_cache_seq_add()) in the respective position and processing for these tokens will be skipped. Only chunks without control/special tokens will be reused. Here is an illustration:

# here each letter generally corresponds to a different token
# same letters represent groups of tokens that are the same in both requests, but are located in different positions

# prompt 0 (cached)
aaaaabbbbbcccccccdddddeeeeeexffggggghhhhhhhxxxxxxxxx

# prompt 1
aaaaaccccccceeeeeeffhhhhhhhyyyyyyyy

Upon submitting prompt 1 for processing, after prompt 0 has been processed and cached:

  • --cache-reuse 0: only the aaaaa prefix will be reused
  • --cache-reuse 1: the entire aaaaaccccccceeeeeeffhhhhhhh will be reused
  • --cache-reuse 3: only the aaaaaccccccceeeeee part will be reused

The cache reuse will be done only for requests with "cache_prompt": true.

Example

# start a server with cache reusing enabled
./llama-server -m ${model.gguf} --port 8012 --cache-reuse 512

# long request with the word "hello" repeated 512 times
chunk=$(printf 'hello %.0s' {1..512})
curl \
    --request POST --url http://127.0.0.1:8012/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Some prefix. Reuse: '"${chunk}"'", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# ... computes 519 tokens ...

# submit new request with the prefix removed. note the leading space before "Reuse"
curl \
    --request POST --url http://127.0.0.1:8012/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": " Reuse: '"${chunk}"'", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# ... reuses 516 tokens and computes just 1 token ...

@wooooyeahhhh
Copy link

Does this work similar to Koboldcpp's context shift?

@ngxson
Copy link
Collaborator

ngxson commented Oct 12, 2024

Does this work similar to Koboldcpp's context shift?

If I understand correctly from this post then yes, it is.

Before, I had a similar feature request here: #5793 , which will be possible thanks to the current PR

@ggerganov
Copy link
Owner Author

Yes, it's the same idea as proposed in #5793. I've been experimenting today with context reuse for code completion and results seem promising.

@ngxson
Copy link
Collaborator

ngxson commented Oct 12, 2024

Btw @ggerganov a while ago I remembered there was a discussion on storing token ID on KV cache. I'm wondering if it's complicated to add API like llama_kv_get_tokens(int seq_id) and use it instead of having to synchronize between actual KV and slot.cache_tokens. What do you think?

@ggerganov ggerganov marked this pull request as ready for review October 13, 2024 10:24
@ggerganov
Copy link
Owner Author

We should extend the API to support that. Maybe llama_token id = llama_kv_cache_seq_get_token(ctx, seq_id, pos);

@ggerganov ggerganov mentioned this pull request Oct 13, 2024
7 tasks
@ggerganov ggerganov merged commit c7181bd into master Oct 13, 2024
58 checks passed
@ggerganov ggerganov deleted the gg/server-reuse-context branch October 13, 2024 15:52
drollings pushed a commit to drollings/llama.cpp that referenced this pull request Oct 18, 2024
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
@ngxson
Copy link
Collaborator

ngxson commented Nov 1, 2024

I have a small question regarding the illustration on the description:

--cache-reuse 3: only the aaaaaccccccceeeeee part will be reused

AFAIU we only skip the ff part because its length is less than 3. But in this case, why the next part hhhhhhh is also skipped?

@ggerganov
Copy link
Owner Author

It's skipped mainly to simplify the batch construction:

With the current implementation, we stop reusing chunks at the first token that cannot be reused. This way, when we create the llama_batch for the new prompt, we start from n_past and add all remaining tokens with increasing positions:

n_past:   f
n_past+1: f
n_past+2: h
n_past+3: h
...
n_past+2+H+Y: y

The alternative that you suggest is if we reused the h chunk. In that case the new batch would have to look like this:

pos_f:    f
pos_f+1:  f
pos_y:    y
pos_y+1:  y
...
pos_y+Y:  y

There is no longer the concept of n_past. Instead, we would have to maintain more complicated information about the token positions.

I'm very interested in trying this approach and see if it is viable, but the extra complexity at this point would be took much. Maybe in the future.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants