[User] Implement Streaming LLM - Make the inference more efficient #3440

errorsandwarnings · 2023-10-02T17:38:36Z

Prerequisites

Context length limit is an issue on all LLMs. The following repository and associated paper is demonstrating that keeping the 4 initial tokens will enable a infinite context length on most common LLMs without sacrificing performance or efficiency.

Code : https://github.com/mit-han-lab/streaming-llm

Paper reference inside the repo which demonstrates the attention-sink effect of LLMs and how to take advantage of it.

Current Behavior

There is a limit on context length defined mostly by pre-training. Other approaches like rope or sliding window have their pros and cons, none of them can get to a higher context length than this apporach.

ggerganov · 2023-10-02T18:19:17Z

Yup, I've already read the paper and the good news is that after #3228 this technique is trivial to support in llama.cpp.
It's technically already implemented by setting n_keep == 4 in main

Dampfinchen · 2023-10-02T18:31:04Z

They also demonstrate how memory efficient this technique is (without it, it runs OOM very early). Does that basically mean infinite context with steady and low VRAM usage? That would be revolutionary and render vectorDBs obsolete.

nathanodle · 2023-10-02T19:50:48Z

They also demonstrate how memory efficient this technique is (without it, it runs OOM very early). Does that basically mean infinite context with steady and low VRAM usage? That would be revolutionary and render vectorDBs obsolete.

This isn't infinite context the way you're reading it. It uses a novel modification to attention so that you can generate "infinite" output

staviq · 2023-10-02T19:59:01Z

If somebody can confirm if I'm understanding that paper right, I'd be grateful:

They are proposing a solution for infinite text length, not infinite context length, right ? And their observation is that the consistency of "internal state" depends heavily on first tokens, so naively keeping initial tokens and implementing sliding context window on the rest, allows the LLM to keep its sanity intact, and first tokens are "important" because appending tokens from one side, makes the first token "exist" for N iterations, second token for N-1 iterations etc, so the first tokens are "seen" by all subsequent iterations but not the other way around ?

errorsandwarnings · 2023-10-02T21:06:34Z

They also demonstrate how memory efficient this technique is (without it, it runs OOM very early). Does that basically mean infinite context with steady and low VRAM usage? That would be revolutionary and render vectorDBs obsolete.

This isn't infinite context the way you're reading it. It uses a novel modification to attention so that you can generate "infinite" output

Yes, it took me a while to understand and their newly posted FAQ clarifies it. I have edited the title. I believe it is a good technique to be part of this project.

Dampfinchen · 2023-10-03T19:38:36Z

True. This one however...

https://github.com/tomaarsen/attention_sinks

is potentially relevant. It's about the input sequence length, so the context this time. Look at that steady memory usage!

KerfuffleV2 · 2023-10-03T19:51:55Z

It's about the input sequence length, so the context this time. Look at that steady memory usage!

The key point is the input sequence length can be very long but the context the model is considering stays constant. So you could feed it a book, have it write a book worth of content but it won't "remember" or take into account what was in the sequence 4,096 tokens ago or whatever.

errorsandwarnings · 2023-10-03T20:16:20Z

The VRAM usage is impressive.

phillip-kravtsov · 2023-10-04T04:11:18Z

I think we need n_keep=4 but also n_discard=1 here to correctly implement StreamingLLM: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L504.

Assuming that the llama.cpp shifting of the k-cache matches the behavior described in the paper, where the keys are rotated according to their position in the cache rather than the text (which I believe they are but not certain)

ggerganov · 2023-10-04T06:26:51Z

I think we need n_keep=4 but also n_discard=1 here to correctly implement StreamingLLM:

That's correct, although I don't think one really needs to shift each new token. The benefit would be marginal.

SlyEcho · 2023-10-04T12:00:21Z

I think the biggest impact this would give is to remove the expensive reevaluation that is currently being done but the effect is the same as --n-keep 4.

This is the RoPE handling part: modify_llama.py. I need time to understand it...

ggerganov · 2023-10-04T12:16:30Z

There is no longer expensive re-evaluation in main and server since #3228 was merged. Instead of re-evaluating, we are now "shifting" the KV cache which is a relatively cheap operation and in some sense is equivalent to the approach proposed in the paper. I would say we even have an advantage because we have the option to set n_discard == 8 for example which would make RoPE recalculation 1 every 8 tokens instead of on each token as it is done in StreamingLLM

Tostino · 2023-10-05T13:41:05Z

Just wondering, does this give us the option to choose where the sliding window begins? e.g. I have a prompt template as seen here:

<#meta#>
- Date: 2023-10-05
- Task: chat
<#system#>
You are a conversational AI having a turn based chat with a user.
<#chat#>
<#user#>
Message 1
<#bot#>
Response 1
<#user#>
Message 2
<#user_context#>
Some Context
<#bot#>
Response 2
<#user#>
Message 3
<#bot#>
Response 3

Could I anchor this portion:

<#meta#>
- Date: 2023-10-05
- Task: chat
<#system#>
You are a conversational AI having a turn based chat with a user.
<#chat#>

And have the kv cache shift only over the chat portion?

Or am I just misunderstanding things?

ggerganov · 2023-10-05T14:04:28Z

Yes - count the tokens in that portion and set n_keep equal to that number

Tostino · 2023-10-05T16:20:26Z

And is n_keep configurable during inference time? One of the features I was planning on, was integrating an ensemble LLM which can modify the prompt template at specific points during inference. E.g. to change the current task, or change the system prompt to align with the current problem that is being worked on in the response, and then resume inference for example. So the number of tokens in that window may change.

ggerganov · 2023-10-05T16:35:28Z

You can modify it - the API is very flexible. Though to achieve your goal, it would take more than just updating n_keep. One way is to have separate context for each prompt that you evaluate once at the start. Or you can have one context and different sequence ids for the different prompts.

github-actions · 2024-04-03T01:15:25Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

MoonRide303 · 2024-04-03T08:00:06Z

@ggerganov Is it possible to use StreamingLLM / Windowed Attention with Attention Sinks method as defined in paper via main, currently? If so - how? Something like --keep 4 or --keep 8 will be enough? I don't see --n_keep or --n-discard in main options list. Results in paper looked really impressive - basically infite chats with low PPL (they've tested it up to 4 million token), without extra computational / memory costs - shouldn't it be default setting for compatible models? Or I am missing something, and there are some downsides not mentioned in the paper?

ggerganov · 2024-04-04T07:10:02Z

The discussion above is still valid, you can use the --keep argument to control the sink size:

llama.cpp/common/common.cpp

Line 1386 in 4399f13

    
           printf("  --keep N              number of tokens to keep from the initial prompt (default: %d, -1 = all)\n", params.n_keep);

The n_discard is currently not exposed for modification through the CLI, but it is easy to adjust in the code:

llama.cpp/examples/main/main.cpp

Line 554 in 4399f13

const int n_discard = n_left/2;

As discussed earlier, it is likely required to set n_discard == 1 in order to match the implementation from the paper.

Results in paper looked really impressive .. Or I am missing something, and there are some downsides not mentioned in the paper?

You might be missing that even though the PPL is low, it does no mean that the entire information from the processed context is "visible". The model will still "see" just the last n_ctx tokens. More info:

hypothesis: llama: implement YaRN RoPE scaling #2268 (comment)
test: examples : add passkey test #3856
conclusion: llama: implement YaRN RoPE scaling #2268 (comment)

errorsandwarnings changed the title ~~[User] Implement Steaming LLM - Let's remove limit on context length~~ [User] Implement Steaming LLM - Make the inference more efficient Oct 2, 2023

SlyEcho changed the title ~~[User] Implement Steaming LLM - Make the inference more efficient~~ [User] Implement Streaming LLM - Make the inference more efficient Oct 4, 2023

Liuxyly mentioned this issue Oct 15, 2023

Implement Streaming LLM ollama/ollama#792

Closed

HiroseKoichi mentioned this issue Oct 25, 2023

[FEATURE_REQUEST] Trimming the chat history in chunks to speed up prompt ingestion for llama.cpp SillyTavern/SillyTavern#1278

Closed

twoletters mentioned this issue Nov 29, 2023

Add n_keep parameter to LLama constructor to enable Streaming-LLM abetlen/llama-cpp-python#954

Open

github-actions bot added the stale label Mar 19, 2024

MoonRide303 mentioned this issue Mar 29, 2024

ContextShift sometimes degrades output LostRuins/koboldcpp#550

Closed

4 tasks

github-actions bot closed this as completed Apr 3, 2024

GameOverFlowChart mentioned this issue May 3, 2024

n_keep parameter Vali-98/ChatterUI#19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User] Implement Streaming LLM - Make the inference more efficient #3440

[User] Implement Streaming LLM - Make the inference more efficient #3440

errorsandwarnings commented Oct 2, 2023

ggerganov commented Oct 2, 2023

Dampfinchen commented Oct 2, 2023 •

edited

Loading

nathanodle commented Oct 2, 2023

staviq commented Oct 2, 2023

errorsandwarnings commented Oct 2, 2023

Dampfinchen commented Oct 3, 2023 •

edited

Loading

KerfuffleV2 commented Oct 3, 2023

errorsandwarnings commented Oct 3, 2023

phillip-kravtsov commented Oct 4, 2023

ggerganov commented Oct 4, 2023

SlyEcho commented Oct 4, 2023

ggerganov commented Oct 4, 2023 •

edited

Loading

Tostino commented Oct 5, 2023

ggerganov commented Oct 5, 2023

Tostino commented Oct 5, 2023

ggerganov commented Oct 5, 2023

github-actions bot commented Apr 3, 2024

MoonRide303 commented Apr 3, 2024

ggerganov commented Apr 4, 2024

[User] Implement Streaming LLM - Make the inference more efficient #3440

[User] Implement Streaming LLM - Make the inference more efficient #3440

Comments

errorsandwarnings commented Oct 2, 2023

Prerequisites

Current Behavior

ggerganov commented Oct 2, 2023

Dampfinchen commented Oct 2, 2023 • edited Loading

nathanodle commented Oct 2, 2023

staviq commented Oct 2, 2023

errorsandwarnings commented Oct 2, 2023

Dampfinchen commented Oct 3, 2023 • edited Loading

KerfuffleV2 commented Oct 3, 2023

errorsandwarnings commented Oct 3, 2023

phillip-kravtsov commented Oct 4, 2023

ggerganov commented Oct 4, 2023

SlyEcho commented Oct 4, 2023

ggerganov commented Oct 4, 2023 • edited Loading

Tostino commented Oct 5, 2023

ggerganov commented Oct 5, 2023

Tostino commented Oct 5, 2023

ggerganov commented Oct 5, 2023

github-actions bot commented Apr 3, 2024

MoonRide303 commented Apr 3, 2024

ggerganov commented Apr 4, 2024

Dampfinchen commented Oct 2, 2023 •

edited

Loading

Dampfinchen commented Oct 3, 2023 •

edited

Loading

ggerganov commented Oct 4, 2023 •

edited

Loading