-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite loop of "context shift" #3969
Comments
I can confirm this issue: I'm sporadically getting the same here with some models - especially when using grammars. However it seems to happen also when not using grammar but also with only text. I can hit that programmatically if I use grammars when having a bunch of requests running in sequence. |
I constantly see this error using the |
seeing this today with mistral 7b on or off GPU, latest code |
Same issue here with llama-2-70b-chat |
Another confirmation: this time with deepseek-coder-6.7b-instruct.Q5_K_M.gguf |
another reproducer seems tinyllama too: mudler/LocalAI#1447 (comment) |
hmm I have experienced this issue as well in the past. /EDIT |
I can reproduce the problem when using the parallel request feature of the server with
After setting the processing slots to |
I could also reproduce it with a server using one single slot, when the model generated a content that exceeded the context size, which may happen rarely, if no stop symbol is generated. But its seems to be an easy way to avoid it by defining the max context in the request using the "n_predict" parameter. (which is not used or mentioned in the above examples) |
Can someone with a repro check if the following patch resolves the issue: diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index 79eacf82..2d97f8ab 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -1680,7 +1680,7 @@ struct llama_server_context
{
// Shift context
const int n_left = slot.n_past - slot.params.n_keep - 1;
- const int n_discard = n_left / 2;
+ const int n_discard = std::min(n_left, 32);
LOG_TEE("slot %d: context shift - n_keep = %d, n_left = %d, n_discard = %d\n", slot.id, slot.params.n_keep, n_left, n_discard);
llama_kv_cache_seq_rm (ctx, slot.id, slot.params.n_keep + 1 , slot.params.n_keep + n_discard + 1); |
Sorry, but the patch has not resolved the issue for me. server log: |
this is my code from server using typescript, very simple: import { OpenAI, ChatOpenAI } from "@langchain/openai";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";
async function main4() {
const model = new ChatOpenAI({
openAIApiKey: "YOUR-API-KEY", // In Node.js defaults to process.env.OPENAI_API_KEY
configuration: {
// baseURL: "http://localhost:5001/v1",
baseURL: "http://127.0.0.1:8080/v1", // llamafile
},
temperature: 0.9,
});
const res = await model.invoke([new HumanMessage("xin chào?")]);
console.log({ res });
}
main4(); using ollamafile 0.6 with using koboldcpp-rocm with Since it built on top of llama.cpp. I guess that some kind of param cause this issue, not the content or model itself. Do you have any clue? I think if the bug come from ts server client. Must be some issue with the payload or config. Maybe I can change the parameter to test? @ggerganov this is openAI example:
this is my result with ollamafile and get infinity generate
server log:
using server ollamafile with The llamafile result {
"timestamp": 1705547220,
"level": "VERBOSE",
"function": "process_token",
"line": 1123,
"message": "next token",
"token": 2659,
"token_text": "User",
"has_next_token": true,
"n_remain": 389,
"num_tokens_predicted": 11,
"stopped_eos": false,
"stopped_word": false,
"stopped_limit": false,
"stopping_word": ""
}
{
"timestamp": 1705547220,
"level": "VERBOSE",
"function": "operator()",
"line": 2902,
"message": "data stream",
"to_send": "data: {\"content\":\"\",\"multimodal\":false,\"slot_id\":0,\"stop\":false}\n\n"
}
{
"timestamp": 1705547220,
"level": "VERBOSE",
"function": "process_token",
"line": 1123,
"message": "next token",
"token": 29901,
"token_text": ":",
"has_next_token": false,
"n_remain": 389,
"num_tokens_predicted": 12,
"stopped_eos": false,
"stopped_word": true,
"stopped_limit": false,
"stopping_word": "User:"
}
{
"timestamp": 1705547220,
"level": "VERBOSE",
"function": "operator()",
"line": 2902,
"message": "data stream",
"to_send": "data: {\"content\":\"\",\"multimodal\":false,\"slot_id\":0,\"stop\":false}\n\n"
}
print_timings: prompt eval time = 108.71 ms / 56 tokens ( 1.94 ms per token,
515.14 tokens per second)
print_timings: eval time = 209.11 ms / 12 runs ( 17.43 ms per token,
57.39 tokens per second)
print_timings: total time = 317.82 ms
slot 0 released (69 tokens in cache) the server result with inifity result
the only different is to_send have data
|
same problem here, running openchat-3.5-1210 Q8_0 with 4 slots, mac m1 |
for all other have this issue, can you test with other model like: p/s: I still have this issue, look like it happen random |
The same infinite loop with neauralbeagle and localai 2.7.0 |
This bug only appears if a request slot exceeds its available context size. We simply worked around this problem by using a model with a context size that fits our use cases. We ran into this bug quite often, because we did not understand the implications of using So the bug is still there and will (sometimes) be triggered by exceeding the available context size of a request slot. This can be reproduced "reliably" by loading a model with |
Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969
Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969
Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969
It is really easy to trigger this bug by now: just set a very small context size (I did here by just running phi-2, and specifying a context size of 10), with a prompt that does not follow what the model was fine-tuned against: that will likely put the model in condition to hallucinate and keep going forever.
@diegottt this is going to be workarounded in LocalAI in the next releases (by disabling context shifting entirely) |
Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969
@ggerganov as a workaround, it's possible to hard cap the maximum tokens to be generated with #5549 and stop the infinite loop:
Prompt: Logs:
@tihanyi could you please confirm ? |
The user can set I am closing the issue, and I have documented in a wrong_usage.feature scenario, but maybe the default Feel free to reopen if I miss something here. Note: I did not test the |
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@phymbert That would not fix the problem because the bug is caused by overflowing the context window of a model which holds the prompt tokens plus the predicted tokens. |
Noted, It would be nice if you can add a scenario in the server test framework. |
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Same issue here with qwen1_5-1_8b-chat-q4_0.gguf, blossom-v3-baichuan2-7b.Q4_K_M.gguf and other model on Xiaomi 14. |
I guest It's not model issue, same model => using vulkan is dead, but ROCm still work. look like by GPU device issue |
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
llama.cpp (server) processes inputs
Current Behavior
When chatting with the LLM through
server
(andapi_like_OAI.py
) it works for a bit, but then seemingly when--ctx-size
is exceeded, it gets into an infinite loop ofcontext_shift
s:I have mostly seen:
slot 0: context shift - n_keep = 4092, n_left = 2, n_discard = 1
but am currently looking at:
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
It just keeps repeating this at near-full GPU usage without ever continuing. I have to restart
server
.Environment and Context
I've seen this happen both on the Windows (
llama-b1492-bin-win-cublas-cu12.2.0-x64.zip
) host as well as on WSL2 (tagb1492
,make LLAMA_CUBLAS=1
), with:server -t 16 -m deepseek-coder-33b-instruct.Q4_K_S.gguf -c 6120 --timeout 30 -ngl 65
Note that these are the several-times-corrected gguf's from TheBloke, and the latest at time of writing (there was a tokenizer issue before). md5sum
19a1079a27fd5a6925a34076de8fbf74 deepseek-coder-33b-instruct.Q4_K_S.gguf
From WSL2:
Linux Jorrit 5.10.43.3-microsoft-standard-WSL2 #1 SMP Wed Jun 16 23:47:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Failure Information (for bugs)
Please help provide information about the failure / bug.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
server -t 16 -m deepseek-coder-33b-instruct.Q4_K_S.gguf -c 6120 --timeout 30 -ngl 65
python api_like_OAI.py --chat-prompt "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n" --user-name "\n### Instruction:\n" --ai-name "\n### Response:\n" --system-name "\n"
context shift
loopFailure Logs
The text was updated successfully, but these errors were encountered: