Infinite loop of "context shift" #3969

Chainfire · 2023-11-06T16:02:22Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

llama.cpp (server) processes inputs

Current Behavior

When chatting with the LLM through server (and api_like_OAI.py) it works for a bit, but then seemingly when --ctx-size is exceeded, it gets into an infinite loop of context_shifts:

I have mostly seen:

slot 0: context shift - n_keep = 4092, n_left = 2, n_discard = 1

but am currently looking at:

slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947

It just keeps repeating this at near-full GPU usage without ever continuing. I have to restart server.

Environment and Context

I've seen this happen both on the Windows (llama-b1492-bin-win-cublas-cu12.2.0-x64.zip) host as well as on WSL2 (tag b1492, make LLAMA_CUBLAS=1), with:

server -t 16 -m deepseek-coder-33b-instruct.Q4_K_S.gguf -c 6120 --timeout 30 -ngl 65

Note that these are the several-times-corrected gguf's from TheBloke, and the latest at time of writing (there was a tokenizer issue before). md5sum 19a1079a27fd5a6925a34076de8fbf74 deepseek-coder-33b-instruct.Q4_K_S.gguf

Physical (or virtual) hardware you are using, e.g. for Linux:

From WSL2:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           8
Model name:                      AMD Ryzen Threadripper 2950X 16-Core Processor
Stepping:                        2
CPU MHz:                         3493.482
BogoMIPS:                        6986.96
Virtualization:                  AMD-V
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       512 KiB
L1i cache:                       1 MiB
L2 cache:                        8 MiB
L3 cache:                        32 MiB
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 ss
                                 e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetb
                                 v1 xsaves clzero xsaveerptr virt_ssbd arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload

Operating System, e.g. for Linux:

Linux Jorrit 5.10.43.3-microsoft-standard-WSL2 #1 SMP Wed Jun 16 23:47:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

SDK version, e.g. for Linux:

Python 3.10.13
GNU Make 4.2.1
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

Failure Information (for bugs)

Please help provide information about the failure / bug.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

server -t 16 -m deepseek-coder-33b-instruct.Q4_K_S.gguf -c 6120 --timeout 30 -ngl 65
python api_like_OAI.py --chat-prompt "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n" --user-name "\n### Instruction:\n" --ai-name "\n### Response:\n" --system-name "\n"
Talk to API and exceed context size (I use Aider's test benchmark, which is tricky to get working, but if interested - instructions )
Infinite context shift loop

Failure Logs

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
{"timestamp":1699281996,"level":"INFO","function":"main","line":2267,"message":"build info","build":1492,"commit":"2833a6f"}
{"timestamp":1699281996,"level":"INFO","function":"main","line":2274,"message":"system info","n_threads":16,"n_threads_batch":-1,"total_threads":32,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 22 key-value pairs and 561 tensors from s:\WizardCoder34B\deepseek-coder-33b-instruct.Q4_K_S.gguf (version GGUF V3 (latest))

( ... llama_model_loader ... )

llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                    llama.rope.scale_linear f32
llama_model_loader: - kv  12:                          general.file_type u32
llama_model_loader: - kv  13:                       tokenizer.ggml.model str
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv  21:               general.quantization_version u32
llama_model_loader: - type  f32:  125 tensors
llama_model_loader: - type q4_K:  427 tensors
llama_model_loader: - type q5_K:    8 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 237/32256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 32256
llm_load_print_meta: n_merges         = 31757
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_head           = 56
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 62
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 19200
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 100000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = mostly Q4_K - Small
llm_load_print_meta: model params     = 33.34 B
llm_load_print_meta: model size       = 17.59 GiB (4.53 BPW)
llm_load_print_meta: general.name   = deepseek-ai_deepseek-coder-33b-instruct
llm_load_print_meta: BOS token = 32013 '<∩╜£beginΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: EOS token = 32021 '<|EOT|>'
llm_load_print_meta: PAD token = 32014 '<∩╜£endΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: LF token  = 30 '?'
llm_load_tensors: ggml ctx size =    0.21 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  124.24 MB
llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors: VRAM used: 17891.45 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 6120
llama_new_context_with_model: freq_base  = 100000.0
llama_new_context_with_model: freq_scale = 0.25
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 1482.19 MB
llama_new_context_with_model: kv self size  = 1482.19 MB
llama_build_graph: non-view tensors processed: 1430/1430
llama_new_context_with_model: compute buffer total size = 729.96 MB
llama_new_context_with_model: VRAM scratch buffer: 723.33 MB
llama_new_context_with_model: total VRAM used: 20096.97 MB (model: 17891.45 MB, context: 2205.52 MB)
Available slots:
 -> Slot 0 - max context: 6120

llama server listening at http://0.0.0.0:8080

( ... lots of API calls ... )

print_timings: prompt eval time =     514.27 ms /   521 tokens (    0.99 ms per token,  1013.09 tokens per second)
print_timings:        eval time =    9365.17 ms /   250 runs   (   37.46 ms per token,    26.69 tokens per second)
print_timings:       total time =    9879.43 ms
slot 0 released (1119 tokens in cache)
{"timestamp":1699284174,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57682,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 204]
slot 0 : in cache: 347 tokens | to process: 934 tokens
slot 0 : kv cache rm - [347, end)

print_timings: prompt eval time =     845.49 ms /   934 tokens (    0.91 ms per token,  1104.68 tokens per second)
print_timings:        eval time =   13463.77 ms /   352 runs   (   38.25 ms per token,    26.14 tokens per second)
print_timings:       total time =   14309.26 ms
slot 0 released (1634 tokens in cache)
{"timestamp":1699284188,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57686,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 205]
slot 0 : in cache: 336 tokens | to process: 1888 tokens
slot 0 : kv cache rm - [336, end)
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot unavailable
{"timestamp":1699284790,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57694,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284790,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57698,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284791,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57702,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284797,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57706,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284803,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57710,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284833,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57714,"status":404,"method":"POST","path":"/completion","params":{}}
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot unavailable
{"timestamp":1699284864,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57718,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284900,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57722,"status":404,"method":"POST","path":"/completion","params":{}}
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot unavailable
{"timestamp":1699284975,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57726,"status":404,"method":"POST","path":"/completion","params":{}}
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
(repeats forever)

The text was updated successfully, but these errors were encountered:

mudler · 2023-11-25T08:19:03Z

I can confirm this issue: I'm sporadically getting the same here with some models - especially when using grammars. However it seems to happen also when not using grammar but also with only text.

I can hit that programmatically if I use grammars when having a bunch of requests running in sequence.

e-ago · 2023-11-29T14:22:10Z

I constantly see this error using the phind-codellama-34b-v2.Q5_K_M.gguf model.
Is there a workaround? Or we should just wait for the fix?

SteveC · 2023-12-06T23:52:29Z

seeing this today with mistral 7b on or off GPU, latest code

chrism-qmul · 2023-12-11T13:58:23Z

Same issue here with llama-2-70b-chat

greenfoo · 2023-12-13T22:47:22Z

Another confirmation: this time with deepseek-coder-6.7b-instruct.Q5_K_M.gguf

mudler · 2023-12-16T09:33:26Z

another reproducer seems tinyllama too: mudler/LocalAI#1447 (comment)

l4b4r4b4b4 · 2024-01-02T00:09:29Z

hmm I have experienced this issue as well in the past.
I have the feeling it is connected to the context containing certain special characters, that perplex the respective model.

/EDIT
Setting timeouts both on the call and the server seem to prevent all slots to get jammed and the server to completely stale, when doing repeated or concurrent calls.
/EDIT 2
One thing I just came across, is that introduced to many newline characters \n in the instruction prompt using (Sauerkraut-Mixtral-Intruct 8_0 GGUF). Especally consequitive \n\n chars the model does not like one bt.

countzero · 2024-01-11T13:26:34Z

I can reproduce the problem when using the parallel request feature of the server with 10 parallel processing slots:

1:02PM DBG GRPC(dolphin-2_2-yi-34b.Q4_K_M.gguf-127.0.0.1:36141): stderr slot 9: context shift - n_keep = 0, n_left = 407, n_discard = 203
1:02PM DBG GRPC(dolphin-2_2-yi-34b.Q4_K_M.gguf-127.0.0.1:36141): stderr slot 9: context shift - n_keep = 0, n_left = 407, n_discard = 203
1:02PM DBG GRPC(dolphin-2_2-yi-34b.Q4_K_M.gguf-127.0.0.1:36141): stderr slot 9: context shift - n_keep = 0, n_left = 407, n_discard = 203
1:02PM DBG GRPC(dolphin-2_2-yi-34b.Q4_K_M.gguf-127.0.0.1:36141): stderr slot 9: context shift - n_keep = 0, n_left = 407, n_discard = 203

After setting the processing slots to 1 the bug seems not to be present anymore.

tihanyi · 2024-01-13T06:28:12Z

I could also reproduce it with a server using one single slot, when the model generated a content that exceeded the context size, which may happen rarely, if no stop symbol is generated. But its seems to be an easy way to avoid it by defining the max context in the request using the "n_predict" parameter. (which is not used or mentioned in the above examples)

ggerganov · 2024-01-13T17:08:51Z

Can someone with a repro check if the following patch resolves the issue:

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index 79eacf82..2d97f8ab 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -1680,7 +1680,7 @@ struct llama_server_context
             {
                 // Shift context
                 const int n_left    = slot.n_past - slot.params.n_keep - 1;
-                const int n_discard = n_left / 2;
+                const int n_discard = std::min(n_left, 32);
 
                 LOG_TEE("slot %d: context shift - n_keep = %d, n_left = %d, n_discard = %d\n", slot.id, slot.params.n_keep, n_left, n_discard);
                 llama_kv_cache_seq_rm   (ctx, slot.id, slot.params.n_keep + 1            , slot.params.n_keep + n_discard + 1);

tihanyi · 2024-01-13T20:28:15Z

Sorry, but the patch has not resolved the issue for me.
Here is a simple example how to generate:
#server:
./server -m llama-2-7b.Q5_K_S.gguf --n-gpu-layers 33 --ctx-size 2048 --parallel 1
#client:
curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{ "prompt": "from English into French.\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; and => French: les décisions relatives à l’acceptation de nouvelles tâches qui sont nécessaires à l’exécution de la demande de normalisation; et\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; => French: ","stop": ["\n"], "temperature": 0, "seed": 42, "slot_id": 0}'

server log:
....
{"timestamp":1705177449,"level":"INFO","function":"main","line":3224,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 2046, n_discard = 32
and this line is repeating infinitely...

hiepxanh · 2024-01-18T03:36:45Z

this is my code from server using typescript, very simple:

import { OpenAI, ChatOpenAI } from "@langchain/openai";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";

async function main4() {
    const model = new ChatOpenAI({
        openAIApiKey: "YOUR-API-KEY", // In Node.js defaults to process.env.OPENAI_API_KEY
        configuration: {
            // baseURL: "http://localhost:5001/v1",
            baseURL: "http://127.0.0.1:8080/v1", // llamafile
        },
        temperature: 0.9,
    });
    const res = await model.invoke([new HumanMessage("xin chào?")]);
    console.log({ res });
}

main4();

using ollamafile 0.6 with tinyblast, it work if the request is coming from server UI on localhost. But it is instantly dead if the request is come from ts server. get this loop.

using koboldcpp-rocm with CLBlast, it work with no issue. No infinity generate! very weird.

Since it built on top of llama.cpp. I guess that some kind of param cause this issue, not the content or model itself. Do you have any clue? I think if the bug come from ts server client. Must be some issue with the payload or config. Maybe I can change the parameter to test? @ggerganov

this is openAI example:

const response = await model.call("Tell me a joke.", {
 callbacks: [
   {
     handleLLMNewToken(token: string) {
       console.log({ token });
     },
   },
 ],
});
console.log(response);
/*
{ token: '\n' }
{ token: '\n' }
{ token: 'Q' }
{ token: ':' }
{ token: ' Why' }
{ token: ' did' }
{ token: ' the' }
{ token: ' chicken' }

this is my result with ollamafile and get infinity generate


 const response = await model.invoke("Tell me a joke.", {
        callbacks: [
            {
                handleLLMNewToken(token: string) {
                    console.log({ token });
                },
            },
        ],
    });
    console.log(response);

// then result is:
{ token: '' }
{ token: 'Why' }
{ token: ' don' }
{ token: "'" }
{ token: 't' }
{ token: ' scient' }
{ token: 'ists' }
{ token: ' trust' }
{ token: ' atoms' }
{ token: '?' }
{ token: '\n' }
{ token: 'B' }
{ token: 'ecause' }
{ token: ' they' }
{ token: ' make' }
{ token: ' up' }
{ token: ' everything' }
{ token: '.' }
{ token: ' ' }
{ token: '\n' }
{ token: '\n' }
{ token: ' ' }
{ token: '\n' }
{ token: '\n' }
{ token: ' An' }
{ token: 'yon' }
{ token: 'e' }
{ token: ' else' }
{ token: ' is' }
{ token: ' as' }
{ token: 'sis' }
{ token: 'ting' }
{ token: ' this' }
{ token: ' user' }
{ token: ' today' }
{ token: '?' }
{ token: ' ' }
{ token: '\n' }
{ token: '\n' }
{ token: ' ' }

infinity

server log:

{"timestamp":1705548276,"level":"INFO","function":"log_server_request","line":2741,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 released (395 tokens in cache)
slot 0 is processing [task id: 4]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255

using server ollamafile with --verbose to see what happening?

The llamafile result

{
    "timestamp": 1705547220,
    "level": "VERBOSE",
    "function": "process_token",
    "line": 1123,
    "message": "next token",
    "token": 2659,
    "token_text": "User",
    "has_next_token": true,
    "n_remain": 389,
    "num_tokens_predicted": 11,
    "stopped_eos": false,
    "stopped_word": false,
    "stopped_limit": false,
    "stopping_word": ""
}
{
    "timestamp": 1705547220,
    "level": "VERBOSE",
    "function": "operator()",
    "line": 2902,
    "message": "data stream",
    "to_send": "data: {\"content\":\"\",\"multimodal\":false,\"slot_id\":0,\"stop\":false}\n\n"
}
{
    "timestamp": 1705547220,
    "level": "VERBOSE",
    "function": "process_token",
    "line": 1123,
    "message": "next token",
    "token": 29901,
    "token_text": ":",
    "has_next_token": false,
    "n_remain": 389,
    "num_tokens_predicted": 12,
    "stopped_eos": false,
    "stopped_word": true,
    "stopped_limit": false,
    "stopping_word": "User:"
}
{
    "timestamp": 1705547220,
    "level": "VERBOSE",
    "function": "operator()",
    "line": 2902,
    "message": "data stream",
    "to_send": "data: {\"content\":\"\",\"multimodal\":false,\"slot_id\":0,\"stop\":false}\n\n"
}

print_timings: prompt eval time =     108.71 ms /    56 tokens (    1.94 ms per token,
515.14 tokens per second)
print_timings:        eval time =     209.11 ms /    12 runs   (   17.43 ms per token,
57.39 tokens per second)
print_timings:       total time =     317.82 ms
slot 0 released (69 tokens in cache)

the server result with inifity result

{
        "timestamp": 1705548561,
        "level": "VERBOSE",
        "function": "process_token",
        "line": 1123,
        "message": "next token",
        "token": 32225,
        "token_text": " nhất",
        "has_next_token": true,
        "n_remain": -1,
        "num_tokens_predicted": 11,
        "stopped_eos": false,
        "stopped_word": false,
        "stopped_limit": false,
        "stopping_word": ""
    }
{
        "timestamp": 1705548561,
        "level": "VERBOSE",
        "function": "operator()",
        "line": 3001,
        "message": "data stream",
        "to_send": "data: {\"choices\":[{\"delta\":{\"content\":\" nhất\"},\"finish_reason\":null,\"index\":0}],\"created\":1705548561,\"id\":\"chatcmpl-JOrfRbw7MP12UUAp5QUANpsJgFE2dyur\",\"model\":\"gpt-3.5-turbo\",\"object\":\"chat.completion.chunk\"}\n\n"
    }
{
        "timestamp": 1705548561,
        "level": "VERBOSE",
        "function": "process_token",
        "line": 1123,
        "message": "next token",
        "token": 29901,
        "token_text": ":",
        "has_next_token": true,
        "n_remain": -1,
        "num_tokens_predicted": 12,
        "stopped_eos": false,
        "stopped_word": false,
        "stopped_limit": false,
        "stopping_word": ""
    }
{
        "timestamp": 1705548561,
        "level": "VERBOSE",
        "function": "operator()",
        "line": 3001,
        "message": "data stream",
        "to_send": "data: {\"choices\":[{\"delta\":{\"content\":\":\"},\"finish_reason\":null,\"index\":0}],\"created\":1705548561,\"id\":\"chatcmpl-IQds0zxRfOLB26PZNFzI7n4Jp8PiEvyt\",\"model\":\"gpt-3.5-turbo\",\"object\":\"chat.completion.chunk\"}\n\n"
    }
{
        "timestamp": 1705548561,
        "level": "VERBOSE",
        "function": "process_token",
        "line": 1123,
        "message": "next token",
        "token": 34413,
        "token_text": " rửa",
        "has_next_token": true,
        "n_remain": -1,
        "num_tokens_predicted": 13,
        "stopped_eos": false,
        "stopped_word": false,
        "stopped_limit": false,
        "stopping_word": ""
    }

llamafile-result.json

server-result.json

the only different is to_send have data

 "to_send": "data: {\"choices\":[{\"delta\":{\"content\":\":\"},\"finish_reason\":null,\"index\":0}],\"created\":1705548561,\"id\":\"chatcmpl-IQds0zxRfOLB26PZNFzI7n4Jp8PiEvyt\",\"model\":\"gpt-3.5-turbo\",\"object\":\"chat.completion.chunk\"}\n\n"
    }

riddlegit · 2024-01-18T03:37:01Z

same problem here, running openchat-3.5-1210 Q8_0 with 4 slots, mac m1

hiepxanh · 2024-01-20T10:19:50Z

for all other have this issue, can you test with other model like: TheBloke/dolphin-2_6-phi-2.Q8_0.gguf ?
after changing other model, I dont see this issue happen

p/s: I still have this issue, look like it happen random

diegottt · 2024-02-01T11:19:26Z

The same infinite loop with neauralbeagle and localai 2.7.0

countzero · 2024-02-01T19:30:01Z

This bug only appears if a request slot exceeds its available context size. We simply worked around this problem by using a model with a context size that fits our use cases.

We ran into this bug quite often, because we did not understand the implications of using --parallel, --cont-batching and --ctx-size correctly. This explanation by @ggerganov helped a lot: #4130 (comment)

So the bug is still there and will (sometimes) be triggered by exceeding the available context size of a request slot. This can be reproduced "reliably" by loading a model with --ctx-size=2048, --parallel=10 and --cont-batching so that each request slot only has a context size of 204 tokens. Then requesting the server with multiple prompts > 204 tokens will trigger the infinite loop of "context shift" bug.

Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969

mudler · 2024-02-13T18:14:45Z

This bug only appears if a request slot exceeds its available context size. We simply worked around this problem by using a model with a context size that fits our use cases.

We ran into this bug quite often, because we did not understand the implications of using --parallel, --cont-batching and --ctx-size correctly. This explanation by @ggerganov helped a lot: #4130 (comment)

So the bug is still there and will (sometimes) be triggered by exceeding the available context size of a request slot. This can be reproduced "reliably" by loading a model with --ctx-size=2048, --parallel=10 and --cont-batching so that each request slot only has a context size of 204 tokens. Then requesting the server with multiple prompts > 204 tokens will trigger the infinite loop of "context shift" bug.

It is really easy to trigger this bug by now: just set a very small context size (I did here by just running phi-2, and specifying a context size of 10), with a prompt that does not follow what the model was fine-tuned against: that will likely put the model in condition to hallucinate and keep going forever.

The same infinite loop with neauralbeagle and localai 2.7.0

@diegottt this is going to be workarounded in LocalAI in the next releases (by disabling context shifting entirely)

Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969

phymbert · 2024-02-18T09:19:00Z

Sorry, but the patch has not resolved the issue for me. Here is a simple example how to generate: #server: ./server -m llama-2-7b.Q5_K_S.gguf --n-gpu-layers 33 --ctx-size 2048 --parallel 1 #client: curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{ "prompt": "from English into French.\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; and => French: les décisions relatives à l’acceptation de nouvelles tâches qui sont nécessaires à l’exécution de la demande de normalisation; et\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; => French: ","stop": ["\n"], "temperature": 0, "seed": 42, "slot_id": 0}'

server log: .... {"timestamp":1705177449,"level":"INFO","function":"main","line":3224,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"} all slots are idle and system prompt is empty, clear the KV cache slot 0 is processing [task id: 0] slot 0 : kv cache rm - [0, end) slot 0: context shift - n_keep = 0, n_left = 2046, n_discard = 32 and this line is repeating infinitely...

@ggerganov as a workaround, it's possible to hard cap the maximum tokens to be generated with #5549 and stop the infinite loop:

./server -m llama-2-7b.Q5_K_S.gguf --n-gpu-layers 33 --ctx-size 2048 --parallel 1 --n-predict 2048

Prompt:
curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{ "prompt": "from English into French.\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; and => French: les décisions relatives à l’acceptation de nouvelles tâches qui sont nécessaires à l’exécution de la demande de normalisation; et\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; => French: ","stop": ["\n"], "temperature": 0, "seed": 42, "slot_id": 0}'

Logs:

slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
..
slot 0: context shift - n_keep = 0, n_left = 2046, n_discard = 1023
..
print_timings: prompt eval time =     990.36 ms /   101 tokens (    9.81 ms per token,   101.98 tokens per second)
print_timings:        eval time = 1714829.03 ms /  2048 runs   (  837.32 ms per token,     1.19 tokens per second)
print_timings:       total time = 1715819.39 ms
slot 0 released (1126 tokens in cache)

@tihanyi could you please confirm ?

phymbert · 2024-02-23T18:41:04Z

The user can set --n-predict option to cap the number of tokens any completion request can generate or pass n_predict/max_tokens in the request body. Otherwise infinite loop scenario can occur if the model hallucinates and does not stop answering.

I am closing the issue, and I have documented in a wrong_usage.feature scenario, but maybe the default --n-predict must be set to --ctx-size.

Feel free to reopen if I miss something here.

Note: I did not test the --timeout option behavior on infinite generation.

* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

countzero · 2024-02-28T09:45:12Z

[...] maybe the default --n-predict must be set to --ctx-size.

@phymbert That would not fix the problem because the bug is caused by overflowing the context window of a model which holds the prompt tokens plus the predicted tokens.

phymbert · 2024-02-28T17:41:13Z

[...] maybe the default --n-predict must be set to --ctx-size.

@phymbert That would not fix the problem because the bug is caused by overflowing the context window of a model which holds the prompt tokens plus the predicted tokens.

Noted, It would be nice if you can add a scenario in the server test framework.

* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

zhouwg · 2024-03-28T02:04:59Z

Same issue here with qwen1_5-1_8b-chat-q4_0.gguf, blossom-v3-baichuan2-7b.Q4_K_M.gguf and other model on Xiaomi 14.

hiepxanh · 2024-03-28T04:03:53Z

I guest It's not model issue, same model => using vulkan is dead, but ROCm still work. look like by GPU device issue

* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Chainfire added the bug-unconfirmed label Nov 6, 2023

mudler mentioned this issue Nov 25, 2023

llama.cpp: infinite loop of context switch mudler/LocalAI#1333

Closed

ggerganov mentioned this issue Nov 25, 2023

server : improvements and maintenance #4216

Open

10 tasks

This was referenced Nov 27, 2023

LocalAI gets blocked at Model already loaded in memory: after hours of successful inferencing mudler/LocalAI#1017

Closed

feat: initial watchdog implementation mudler/LocalAI#1341

Merged

mudler mentioned this issue Dec 16, 2023

Could not load model: SIGILL: illegal instruction mudler/LocalAI#1447

Closed

justaCasualCoder mentioned this issue Dec 18, 2023

"SIGILL: illegal instruction" causing "error reading from server: EOF" mudler/LocalAI#1453

Closed

hiepxanh mentioned this issue Jan 17, 2024

How to talk to llamafile's OpenAI API endpoint Mozilla-Ocho/llamafile#53

Closed

This was referenced Jan 20, 2024

AMD gfx1103 laptop GPU returning HIPBLAS_STATUS_UNKNOWN Mozilla-Ocho/llamafile#188

Closed

Baichuan2-7B-Chat model converted to ggml-model-q4_0.gguf, AI answer does not stop automatically when inference is made #5034

Closed

mudler mentioned this issue Feb 13, 2024

fix(llama.cpp): disable infinite context shifting mudler/LocalAI#1704

Merged

snowyu mentioned this issue Feb 13, 2024

fix(server): infinite loop to inference #5485

Open

phymbert mentioned this issue Feb 17, 2024

server: --n-predict option document and ensure the completion request does not exceed it #5549

Merged

1 task

phymbert mentioned this issue Feb 20, 2024

server: init functional tests #5566

Merged

28 tasks

phymbert closed this as completed Feb 23, 2024

phymbert mentioned this issue Apr 12, 2024

server: stop generation at n_ctx_train if n_predict is not set #6638

Merged

rsoika mentioned this issue Aug 2, 2024

Infinite loop with complex prompt in Mistral 7b imixs/imixs-ai#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite loop of "context shift" #3969

Infinite loop of "context shift" #3969

Chainfire commented Nov 6, 2023

mudler commented Nov 25, 2023 •

edited

Loading

e-ago commented Nov 29, 2023

SteveC commented Dec 6, 2023

chrism-qmul commented Dec 11, 2023

greenfoo commented Dec 13, 2023

mudler commented Dec 16, 2023

l4b4r4b4b4 commented Jan 2, 2024 •

edited

Loading

countzero commented Jan 11, 2024

tihanyi commented Jan 13, 2024 •

edited

Loading

ggerganov commented Jan 13, 2024

tihanyi commented Jan 13, 2024 •

edited

Loading

hiepxanh commented Jan 18, 2024

riddlegit commented Jan 18, 2024 •

edited

Loading

hiepxanh commented Jan 20, 2024 •

edited

Loading

diegottt commented Feb 1, 2024

countzero commented Feb 1, 2024

mudler commented Feb 13, 2024 •

edited

Loading

phymbert commented Feb 18, 2024 •

edited

Loading

phymbert commented Feb 23, 2024 •

edited

Loading

countzero commented Feb 28, 2024

phymbert commented Feb 28, 2024

zhouwg commented Mar 28, 2024 •

edited

Loading

hiepxanh commented Mar 28, 2024

Infinite loop of "context shift" #3969

Infinite loop of "context shift" #3969

Comments

Chainfire commented Nov 6, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

mudler commented Nov 25, 2023 • edited Loading

e-ago commented Nov 29, 2023

SteveC commented Dec 6, 2023

chrism-qmul commented Dec 11, 2023

greenfoo commented Dec 13, 2023

mudler commented Dec 16, 2023

l4b4r4b4b4 commented Jan 2, 2024 • edited Loading

countzero commented Jan 11, 2024

tihanyi commented Jan 13, 2024 • edited Loading

ggerganov commented Jan 13, 2024

tihanyi commented Jan 13, 2024 • edited Loading

hiepxanh commented Jan 18, 2024

riddlegit commented Jan 18, 2024 • edited Loading

hiepxanh commented Jan 20, 2024 • edited Loading

diegottt commented Feb 1, 2024

countzero commented Feb 1, 2024

mudler commented Feb 13, 2024 • edited Loading

phymbert commented Feb 18, 2024 • edited Loading

phymbert commented Feb 23, 2024 • edited Loading

countzero commented Feb 28, 2024

phymbert commented Feb 28, 2024

zhouwg commented Mar 28, 2024 • edited Loading

hiepxanh commented Mar 28, 2024

mudler commented Nov 25, 2023 •

edited

Loading

l4b4r4b4b4 commented Jan 2, 2024 •

edited

Loading

tihanyi commented Jan 13, 2024 •

edited

Loading

tihanyi commented Jan 13, 2024 •

edited

Loading

riddlegit commented Jan 18, 2024 •

edited

Loading

hiepxanh commented Jan 20, 2024 •

edited

Loading

mudler commented Feb 13, 2024 •

edited

Loading

phymbert commented Feb 18, 2024 •

edited

Loading

phymbert commented Feb 23, 2024 •

edited

Loading

zhouwg commented Mar 28, 2024 •

edited

Loading