Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop of "context shift" #3969

Closed
4 tasks done
Chainfire opened this issue Nov 6, 2023 · 23 comments
Closed
4 tasks done

Infinite loop of "context shift" #3969

Chainfire opened this issue Nov 6, 2023 · 23 comments

Comments

@Chainfire
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

llama.cpp (server) processes inputs

Current Behavior

When chatting with the LLM through server (and api_like_OAI.py) it works for a bit, but then seemingly when --ctx-size is exceeded, it gets into an infinite loop of context_shifts:

I have mostly seen:

slot 0: context shift - n_keep = 4092, n_left = 2, n_discard = 1

but am currently looking at:

slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947

It just keeps repeating this at near-full GPU usage without ever continuing. I have to restart server.

Environment and Context

I've seen this happen both on the Windows (llama-b1492-bin-win-cublas-cu12.2.0-x64.zip) host as well as on WSL2 (tag b1492, make LLAMA_CUBLAS=1), with:

server -t 16 -m deepseek-coder-33b-instruct.Q4_K_S.gguf -c 6120 --timeout 30 -ngl 65

Note that these are the several-times-corrected gguf's from TheBloke, and the latest at time of writing (there was a tokenizer issue before). md5sum 19a1079a27fd5a6925a34076de8fbf74 deepseek-coder-33b-instruct.Q4_K_S.gguf

  • Physical (or virtual) hardware you are using, e.g. for Linux:

From WSL2:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           8
Model name:                      AMD Ryzen Threadripper 2950X 16-Core Processor
Stepping:                        2
CPU MHz:                         3493.482
BogoMIPS:                        6986.96
Virtualization:                  AMD-V
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       512 KiB
L1i cache:                       1 MiB
L2 cache:                        8 MiB
L3 cache:                        32 MiB
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 ss
                                 e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetb
                                 v1 xsaves clzero xsaveerptr virt_ssbd arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload
  • Operating System, e.g. for Linux:

Linux Jorrit 5.10.43.3-microsoft-standard-WSL2 #1 SMP Wed Jun 16 23:47:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

  • SDK version, e.g. for Linux:
Python 3.10.13
GNU Make 4.2.1
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

Failure Information (for bugs)

Please help provide information about the failure / bug.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. server -t 16 -m deepseek-coder-33b-instruct.Q4_K_S.gguf -c 6120 --timeout 30 -ngl 65
  2. python api_like_OAI.py --chat-prompt "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n" --user-name "\n### Instruction:\n" --ai-name "\n### Response:\n" --system-name "\n"
  3. Talk to API and exceed context size (I use Aider's test benchmark, which is tricky to get working, but if interested - instructions )
  4. Infinite context shift loop

Failure Logs

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
{"timestamp":1699281996,"level":"INFO","function":"main","line":2267,"message":"build info","build":1492,"commit":"2833a6f"}
{"timestamp":1699281996,"level":"INFO","function":"main","line":2274,"message":"system info","n_threads":16,"n_threads_batch":-1,"total_threads":32,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 22 key-value pairs and 561 tensors from s:\WizardCoder34B\deepseek-coder-33b-instruct.Q4_K_S.gguf (version GGUF V3 (latest))

( ... llama_model_loader ... )

llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                    llama.rope.scale_linear f32
llama_model_loader: - kv  12:                          general.file_type u32
llama_model_loader: - kv  13:                       tokenizer.ggml.model str
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv  21:               general.quantization_version u32
llama_model_loader: - type  f32:  125 tensors
llama_model_loader: - type q4_K:  427 tensors
llama_model_loader: - type q5_K:    8 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 237/32256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 32256
llm_load_print_meta: n_merges         = 31757
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_head           = 56
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 62
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 19200
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 100000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = mostly Q4_K - Small
llm_load_print_meta: model params     = 33.34 B
llm_load_print_meta: model size       = 17.59 GiB (4.53 BPW)
llm_load_print_meta: general.name   = deepseek-ai_deepseek-coder-33b-instruct
llm_load_print_meta: BOS token = 32013 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 32021 '<|EOT|>'
llm_load_print_meta: PAD token = 32014 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token  = 30 '?'
llm_load_tensors: ggml ctx size =    0.21 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  124.24 MB
llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors: VRAM used: 17891.45 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 6120
llama_new_context_with_model: freq_base  = 100000.0
llama_new_context_with_model: freq_scale = 0.25
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 1482.19 MB
llama_new_context_with_model: kv self size  = 1482.19 MB
llama_build_graph: non-view tensors processed: 1430/1430
llama_new_context_with_model: compute buffer total size = 729.96 MB
llama_new_context_with_model: VRAM scratch buffer: 723.33 MB
llama_new_context_with_model: total VRAM used: 20096.97 MB (model: 17891.45 MB, context: 2205.52 MB)
Available slots:
 -> Slot 0 - max context: 6120

llama server listening at http://0.0.0.0:8080

( ... lots of API calls ... )

print_timings: prompt eval time =     514.27 ms /   521 tokens (    0.99 ms per token,  1013.09 tokens per second)
print_timings:        eval time =    9365.17 ms /   250 runs   (   37.46 ms per token,    26.69 tokens per second)
print_timings:       total time =    9879.43 ms
slot 0 released (1119 tokens in cache)
{"timestamp":1699284174,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57682,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 204]
slot 0 : in cache: 347 tokens | to process: 934 tokens
slot 0 : kv cache rm - [347, end)

print_timings: prompt eval time =     845.49 ms /   934 tokens (    0.91 ms per token,  1104.68 tokens per second)
print_timings:        eval time =   13463.77 ms /   352 runs   (   38.25 ms per token,    26.14 tokens per second)
print_timings:       total time =   14309.26 ms
slot 0 released (1634 tokens in cache)
{"timestamp":1699284188,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57686,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 205]
slot 0 : in cache: 336 tokens | to process: 1888 tokens
slot 0 : kv cache rm - [336, end)
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot unavailable
{"timestamp":1699284790,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57694,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284790,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57698,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284791,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57702,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284797,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57706,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284803,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57710,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284833,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57714,"status":404,"method":"POST","path":"/completion","params":{}}
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot unavailable
{"timestamp":1699284864,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57718,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284900,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57722,"status":404,"method":"POST","path":"/completion","params":{}}
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot unavailable
{"timestamp":1699284975,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57726,"status":404,"method":"POST","path":"/completion","params":{}}
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
(repeats forever)
@mudler
Copy link
Contributor

mudler commented Nov 25, 2023

I can confirm this issue: I'm sporadically getting the same here with some models - especially when using grammars. However it seems to happen also when not using grammar but also with only text.

I can hit that programmatically if I use grammars when having a bunch of requests running in sequence.

@e-ago
Copy link

e-ago commented Nov 29, 2023

I constantly see this error using the phind-codellama-34b-v2.Q5_K_M.gguf model.
Is there a workaround? Or we should just wait for the fix?

@SteveC
Copy link

SteveC commented Dec 6, 2023

seeing this today with mistral 7b on or off GPU, latest code

@chrism-qmul
Copy link

Same issue here with llama-2-70b-chat

@greenfoo
Copy link

Another confirmation: this time with deepseek-coder-6.7b-instruct.Q5_K_M.gguf

@mudler
Copy link
Contributor

mudler commented Dec 16, 2023

another reproducer seems tinyllama too: mudler/LocalAI#1447 (comment)

@l4b4r4b4b4
Copy link

l4b4r4b4b4 commented Jan 2, 2024

hmm I have experienced this issue as well in the past.
I have the feeling it is connected to the context containing certain special characters, that perplex the respective model.

/EDIT
Setting timeouts both on the call and the server seem to prevent all slots to get jammed and the server to completely stale, when doing repeated or concurrent calls.
/EDIT 2
One thing I just came across, is that introduced to many newline characters \n in the instruction prompt using (Sauerkraut-Mixtral-Intruct 8_0 GGUF). Especally consequitive \n\n chars the model does not like one bt.

@countzero
Copy link

I can reproduce the problem when using the parallel request feature of the server with 10 parallel processing slots:

1:02PM DBG GRPC(dolphin-2_2-yi-34b.Q4_K_M.gguf-127.0.0.1:36141): stderr slot 9: context shift - n_keep = 0, n_left = 407, n_discard = 203
1:02PM DBG GRPC(dolphin-2_2-yi-34b.Q4_K_M.gguf-127.0.0.1:36141): stderr slot 9: context shift - n_keep = 0, n_left = 407, n_discard = 203
1:02PM DBG GRPC(dolphin-2_2-yi-34b.Q4_K_M.gguf-127.0.0.1:36141): stderr slot 9: context shift - n_keep = 0, n_left = 407, n_discard = 203
1:02PM DBG GRPC(dolphin-2_2-yi-34b.Q4_K_M.gguf-127.0.0.1:36141): stderr slot 9: context shift - n_keep = 0, n_left = 407, n_discard = 203

After setting the processing slots to 1 the bug seems not to be present anymore.

@tihanyi
Copy link

tihanyi commented Jan 13, 2024

I could also reproduce it with a server using one single slot, when the model generated a content that exceeded the context size, which may happen rarely, if no stop symbol is generated. But its seems to be an easy way to avoid it by defining the max context in the request using the "n_predict" parameter. (which is not used or mentioned in the above examples)

@ggerganov
Copy link
Owner

Can someone with a repro check if the following patch resolves the issue:

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index 79eacf82..2d97f8ab 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -1680,7 +1680,7 @@ struct llama_server_context
             {
                 // Shift context
                 const int n_left    = slot.n_past - slot.params.n_keep - 1;
-                const int n_discard = n_left / 2;
+                const int n_discard = std::min(n_left, 32);
 
                 LOG_TEE("slot %d: context shift - n_keep = %d, n_left = %d, n_discard = %d\n", slot.id, slot.params.n_keep, n_left, n_discard);
                 llama_kv_cache_seq_rm   (ctx, slot.id, slot.params.n_keep + 1            , slot.params.n_keep + n_discard + 1);

@tihanyi
Copy link

tihanyi commented Jan 13, 2024

Sorry, but the patch has not resolved the issue for me.
Here is a simple example how to generate:
#server:
./server -m llama-2-7b.Q5_K_S.gguf --n-gpu-layers 33 --ctx-size 2048 --parallel 1
#client:
curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{ "prompt": "from English into French.\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; and => French: les décisions relatives à l’acceptation de nouvelles tâches qui sont nécessaires à l’exécution de la demande de normalisation; et\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; => French: ","stop": ["\n"], "temperature": 0, "seed": 42, "slot_id": 0}'

server log:
....
{"timestamp":1705177449,"level":"INFO","function":"main","line":3224,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 2046, n_discard = 32
and this line is repeating infinitely...

@hiepxanh
Copy link

this is my code from server using typescript, very simple:

import { OpenAI, ChatOpenAI } from "@langchain/openai";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";

async function main4() {
    const model = new ChatOpenAI({
        openAIApiKey: "YOUR-API-KEY", // In Node.js defaults to process.env.OPENAI_API_KEY
        configuration: {
            // baseURL: "http://localhost:5001/v1",
            baseURL: "http://127.0.0.1:8080/v1", // llamafile
        },
        temperature: 0.9,
    });
    const res = await model.invoke([new HumanMessage("xin chào?")]);
    console.log({ res });
}

main4();

using ollamafile 0.6 with tinyblast, it work if the request is coming from server UI on localhost. But it is instantly dead if the request is come from ts server. get this loop.

using koboldcpp-rocm with CLBlast, it work with no issue. No infinity generate! very weird.

Since it built on top of llama.cpp. I guess that some kind of param cause this issue, not the content or model itself. Do you have any clue? I think if the bug come from ts server client. Must be some issue with the payload or config. Maybe I can change the parameter to test? @ggerganov

this is openAI example:

const response = await model.call("Tell me a joke.", {
 callbacks: [
   {
     handleLLMNewToken(token: string) {
       console.log({ token });
     },
   },
 ],
});
console.log(response);
/*
{ token: '\n' }
{ token: '\n' }
{ token: 'Q' }
{ token: ':' }
{ token: ' Why' }
{ token: ' did' }
{ token: ' the' }
{ token: ' chicken' }

this is my result with ollamafile and get infinity generate


 const response = await model.invoke("Tell me a joke.", {
        callbacks: [
            {
                handleLLMNewToken(token: string) {
                    console.log({ token });
                },
            },
        ],
    });
    console.log(response);

// then result is:
{ token: '' }
{ token: 'Why' }
{ token: ' don' }
{ token: "'" }
{ token: 't' }
{ token: ' scient' }
{ token: 'ists' }
{ token: ' trust' }
{ token: ' atoms' }
{ token: '?' }
{ token: '\n' }
{ token: 'B' }
{ token: 'ecause' }
{ token: ' they' }
{ token: ' make' }
{ token: ' up' }
{ token: ' everything' }
{ token: '.' }
{ token: ' ' }
{ token: '\n' }
{ token: '\n' }
{ token: ' ' }
{ token: '\n' }
{ token: '\n' }
{ token: ' An' }
{ token: 'yon' }
{ token: 'e' }
{ token: ' else' }
{ token: ' is' }
{ token: ' as' }
{ token: 'sis' }
{ token: 'ting' }
{ token: ' this' }
{ token: ' user' }
{ token: ' today' }
{ token: '?' }
{ token: ' ' }
{ token: '\n' }
{ token: '\n' }
{ token: ' ' }

infinity

server log:

{"timestamp":1705548276,"level":"INFO","function":"log_server_request","line":2741,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 released (395 tokens in cache)
slot 0 is processing [task id: 4]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255

using server ollamafile with --verbose to see what happening?

The llamafile result

{
    "timestamp": 1705547220,
    "level": "VERBOSE",
    "function": "process_token",
    "line": 1123,
    "message": "next token",
    "token": 2659,
    "token_text": "User",
    "has_next_token": true,
    "n_remain": 389,
    "num_tokens_predicted": 11,
    "stopped_eos": false,
    "stopped_word": false,
    "stopped_limit": false,
    "stopping_word": ""
}
{
    "timestamp": 1705547220,
    "level": "VERBOSE",
    "function": "operator()",
    "line": 2902,
    "message": "data stream",
    "to_send": "data: {\"content\":\"\",\"multimodal\":false,\"slot_id\":0,\"stop\":false}\n\n"
}
{
    "timestamp": 1705547220,
    "level": "VERBOSE",
    "function": "process_token",
    "line": 1123,
    "message": "next token",
    "token": 29901,
    "token_text": ":",
    "has_next_token": false,
    "n_remain": 389,
    "num_tokens_predicted": 12,
    "stopped_eos": false,
    "stopped_word": true,
    "stopped_limit": false,
    "stopping_word": "User:"
}
{
    "timestamp": 1705547220,
    "level": "VERBOSE",
    "function": "operator()",
    "line": 2902,
    "message": "data stream",
    "to_send": "data: {\"content\":\"\",\"multimodal\":false,\"slot_id\":0,\"stop\":false}\n\n"
}

print_timings: prompt eval time =     108.71 ms /    56 tokens (    1.94 ms per token,
515.14 tokens per second)
print_timings:        eval time =     209.11 ms /    12 runs   (   17.43 ms per token,
57.39 tokens per second)
print_timings:       total time =     317.82 ms
slot 0 released (69 tokens in cache)

the server result with inifity result

{
        "timestamp": 1705548561,
        "level": "VERBOSE",
        "function": "process_token",
        "line": 1123,
        "message": "next token",
        "token": 32225,
        "token_text": " nhất",
        "has_next_token": true,
        "n_remain": -1,
        "num_tokens_predicted": 11,
        "stopped_eos": false,
        "stopped_word": false,
        "stopped_limit": false,
        "stopping_word": ""
    }
{
        "timestamp": 1705548561,
        "level": "VERBOSE",
        "function": "operator()",
        "line": 3001,
        "message": "data stream",
        "to_send": "data: {\"choices\":[{\"delta\":{\"content\":\" nhất\"},\"finish_reason\":null,\"index\":0}],\"created\":1705548561,\"id\":\"chatcmpl-JOrfRbw7MP12UUAp5QUANpsJgFE2dyur\",\"model\":\"gpt-3.5-turbo\",\"object\":\"chat.completion.chunk\"}\n\n"
    }
{
        "timestamp": 1705548561,
        "level": "VERBOSE",
        "function": "process_token",
        "line": 1123,
        "message": "next token",
        "token": 29901,
        "token_text": ":",
        "has_next_token": true,
        "n_remain": -1,
        "num_tokens_predicted": 12,
        "stopped_eos": false,
        "stopped_word": false,
        "stopped_limit": false,
        "stopping_word": ""
    }
{
        "timestamp": 1705548561,
        "level": "VERBOSE",
        "function": "operator()",
        "line": 3001,
        "message": "data stream",
        "to_send": "data: {\"choices\":[{\"delta\":{\"content\":\":\"},\"finish_reason\":null,\"index\":0}],\"created\":1705548561,\"id\":\"chatcmpl-IQds0zxRfOLB26PZNFzI7n4Jp8PiEvyt\",\"model\":\"gpt-3.5-turbo\",\"object\":\"chat.completion.chunk\"}\n\n"
    }
{
        "timestamp": 1705548561,
        "level": "VERBOSE",
        "function": "process_token",
        "line": 1123,
        "message": "next token",
        "token": 34413,
        "token_text": " rửa",
        "has_next_token": true,
        "n_remain": -1,
        "num_tokens_predicted": 13,
        "stopped_eos": false,
        "stopped_word": false,
        "stopped_limit": false,
        "stopping_word": ""
    }

llamafile-result.json

server-result.json

the only different is to_send have data

 "to_send": "data: {\"choices\":[{\"delta\":{\"content\":\":\"},\"finish_reason\":null,\"index\":0}],\"created\":1705548561,\"id\":\"chatcmpl-IQds0zxRfOLB26PZNFzI7n4Jp8PiEvyt\",\"model\":\"gpt-3.5-turbo\",\"object\":\"chat.completion.chunk\"}\n\n"
    }

@riddlegit
Copy link

riddlegit commented Jan 18, 2024

same problem here, running openchat-3.5-1210 Q8_0 with 4 slots, mac m1

@hiepxanh
Copy link

hiepxanh commented Jan 20, 2024

for all other have this issue, can you test with other model like: TheBloke/dolphin-2_6-phi-2.Q8_0.gguf ?
after changing other model, I dont see this issue happen

p/s: I still have this issue, look like it happen random

@diegottt
Copy link

diegottt commented Feb 1, 2024

The same infinite loop with neauralbeagle and localai 2.7.0

@countzero
Copy link

This bug only appears if a request slot exceeds its available context size. We simply worked around this problem by using a model with a context size that fits our use cases.

We ran into this bug quite often, because we did not understand the implications of using --parallel, --cont-batching and --ctx-size correctly. This explanation by @ggerganov helped a lot: #4130 (comment)

So the bug is still there and will (sometimes) be triggered by exceeding the available context size of a request slot. This can be reproduced "reliably" by loading a model with --ctx-size=2048, --parallel=10 and --cont-batching so that each request slot only has a context size of 204 tokens. Then requesting the server with multiple prompts > 204 tokens will trigger the infinite loop of "context shift" bug.

mudler added a commit to mudler/LocalAI that referenced this issue Feb 13, 2024
Infinite context loop might as well trigger an infinite loop of context
shifting if the model hallucinates and does not stop answering.
This has the unpleasant effect that the predicion never terminates,
which is the case especially on small models which tends to hallucinate.

Workarounds #1333 by removing
context-shifting.

See also upstream issue: ggerganov/llama.cpp#3969
mudler added a commit to mudler/LocalAI that referenced this issue Feb 13, 2024
Infinite context loop might as well trigger an infinite loop of context
shifting if the model hallucinates and does not stop answering.
This has the unpleasant effect that the predicion never terminates,
which is the case especially on small models which tends to hallucinate.

Workarounds #1333 by removing
context-shifting.

See also upstream issue: ggerganov/llama.cpp#3969
mudler added a commit to mudler/LocalAI that referenced this issue Feb 13, 2024
Infinite context loop might as well trigger an infinite loop of context
shifting if the model hallucinates and does not stop answering.
This has the unpleasant effect that the predicion never terminates,
which is the case especially on small models which tends to hallucinate.

Workarounds #1333 by removing
context-shifting.

See also upstream issue: ggerganov/llama.cpp#3969
@mudler
Copy link
Contributor

mudler commented Feb 13, 2024

This bug only appears if a request slot exceeds its available context size. We simply worked around this problem by using a model with a context size that fits our use cases.

We ran into this bug quite often, because we did not understand the implications of using --parallel, --cont-batching and --ctx-size correctly. This explanation by @ggerganov helped a lot: #4130 (comment)

So the bug is still there and will (sometimes) be triggered by exceeding the available context size of a request slot. This can be reproduced "reliably" by loading a model with --ctx-size=2048, --parallel=10 and --cont-batching so that each request slot only has a context size of 204 tokens. Then requesting the server with multiple prompts > 204 tokens will trigger the infinite loop of "context shift" bug.

It is really easy to trigger this bug by now: just set a very small context size (I did here by just running phi-2, and specifying a context size of 10), with a prompt that does not follow what the model was fine-tuned against: that will likely put the model in condition to hallucinate and keep going forever.

The same infinite loop with neauralbeagle and localai 2.7.0

@diegottt this is going to be workarounded in LocalAI in the next releases (by disabling context shifting entirely)

mudler added a commit to mudler/LocalAI that referenced this issue Feb 13, 2024
Infinite context loop might as well trigger an infinite loop of context
shifting if the model hallucinates and does not stop answering.
This has the unpleasant effect that the predicion never terminates,
which is the case especially on small models which tends to hallucinate.

Workarounds #1333 by removing
context-shifting.

See also upstream issue: ggerganov/llama.cpp#3969
@phymbert
Copy link
Collaborator

phymbert commented Feb 18, 2024

Sorry, but the patch has not resolved the issue for me. Here is a simple example how to generate: #server: ./server -m llama-2-7b.Q5_K_S.gguf --n-gpu-layers 33 --ctx-size 2048 --parallel 1 #client: curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{ "prompt": "from English into French.\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; and => French: les décisions relatives à l’acceptation de nouvelles tâches qui sont nécessaires à l’exécution de la demande de normalisation; et\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; => French: ","stop": ["\n"], "temperature": 0, "seed": 42, "slot_id": 0}'

server log: .... {"timestamp":1705177449,"level":"INFO","function":"main","line":3224,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"} all slots are idle and system prompt is empty, clear the KV cache slot 0 is processing [task id: 0] slot 0 : kv cache rm - [0, end) slot 0: context shift - n_keep = 0, n_left = 2046, n_discard = 32 and this line is repeating infinitely...

@ggerganov as a workaround, it's possible to hard cap the maximum tokens to be generated with #5549 and stop the infinite loop:

./server -m llama-2-7b.Q5_K_S.gguf --n-gpu-layers 33 --ctx-size 2048 --parallel 1 --n-predict 2048

Prompt:
curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{ "prompt": "from English into French.\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; and => French: les décisions relatives à l’acceptation de nouvelles tâches qui sont nécessaires à l’exécution de la demande de normalisation; et\tEnglish: decisions on the acceptance of new work items that are needed for the fulfilment of the standardisation request; => French: ","stop": ["\n"], "temperature": 0, "seed": 42, "slot_id": 0}'

Logs:

slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
..
slot 0: context shift - n_keep = 0, n_left = 2046, n_discard = 1023
..
print_timings: prompt eval time =     990.36 ms /   101 tokens (    9.81 ms per token,   101.98 tokens per second)
print_timings:        eval time = 1714829.03 ms /  2048 runs   (  837.32 ms per token,     1.19 tokens per second)
print_timings:       total time = 1715819.39 ms
slot 0 released (1126 tokens in cache)

@tihanyi could you please confirm ?

@phymbert
Copy link
Collaborator

phymbert commented Feb 23, 2024

The user can set --n-predict option to cap the number of tokens any completion request can generate or pass n_predict/max_tokens in the request body. Otherwise infinite loop scenario can occur if the model hallucinates and does not stop answering.

I am closing the issue, and I have documented in a wrong_usage.feature scenario, but maybe the default --n-predict must be set to --ctx-size.

Feel free to reopen if I miss something here.

Note: I did not test the --timeout option behavior on infinite generation.

phymbert added a commit that referenced this issue Feb 24, 2024
* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" #3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault #5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@countzero
Copy link

[...] maybe the default --n-predict must be set to --ctx-size.

@phymbert That would not fix the problem because the bug is caused by overflowing the context window of a model which holds the prompt tokens plus the predicted tokens.

@phymbert
Copy link
Collaborator

[...] maybe the default --n-predict must be set to --ctx-size.

@phymbert That would not fix the problem because the bug is caused by overflowing the context window of a model which holds the prompt tokens plus the predicted tokens.

Noted, It would be nice if you can add a scenario in the server test framework.

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this issue Mar 13, 2024
* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault ggerganov#5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@zhouwg
Copy link
Contributor

zhouwg commented Mar 28, 2024

Same issue here with qwen1_5-1_8b-chat-q4_0.gguf, blossom-v3-baichuan2-7b.Q4_K_M.gguf and other model on Xiaomi 14.

@hiepxanh
Copy link

I guest It's not model issue, same model => using vulkan is dead, but ROCm still work. look like by GPU device issue

hodlen pushed a commit to hodlen/llama.cpp that referenced this issue Apr 1, 2024
* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault ggerganov#5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests