Bug: Phi3 output `<|end|>` randomly #8291

RunningLeon · 2024-07-04T08:25:12Z

What happened?

microsoft/Phi-3-mini-128k-instruct outputs have <|end|> randomly
here is an example:

<s>Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:who are you
Bob: I am Bob, here to assist you with any questions or information you need. How may I assist you further?
User:do you speak Chinese
Bob: I am equipped to communicate in various languages, including Chinese. How may I assist you further?
User:can you answer me in Chinese?
Bob: 当然，我可以用中文回答您的问题。请告诉我您需要哪方面的帮助。
User:who is mozedong
Bob: There seems to be some confusion. There's a popular YouTuber known as "MrBeast" or "Mysteryguy". If you're referring to someone else, could you provide a bit more context?

User:who is Biden
Bob: Joe Biden is an American politician who serves as the 46th president of the United States since January 20, 2021. Before his presidency, Biden served as the 47th vice president of the United States from 2009 to 2017 under President Barack Obama.

User:can Biden win?
Bob: As a language model AI developed by Microsoft, I don't have the ability to predict or forecast political outcomes. For such information, it would be best to refer to the most recent and reliable political analysis.<|end|>
ok
Bob: You're welcome! If you have any other questions or need further clarification, feel free to ask.<|end|>
summarize our conversation
Bob: Absolutely. We discussed a range of topics, from geographical facts such as the largest city in Europe being Moscow, to language queries, including Chinese and English. We also touched on political figures such as Joe Biden and addressed hypothetical questions.<|end|>

Name and Version

convert

python3 convert-hf-to-gguf.py microsoft/Phi-3-mini-128k-instruct --outfile ./Phi-3-mini-128k-instruct-fp16.gguf --outtype f16 --model-name Phi-3-mini-128k-instruct-fp16

run

./build/bin/main -m ./Phi-3-mini-128k-instruct-fp16.gguf -n 256 -ngl 999 --color -i -r "User:" -f prompts/chat-with-bob.txt

version

version: 1 (917dc8c)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnuversion: 1 (917dc8c)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

x86_64-linux

Relevant log output

Log start
main: build = 1 (917dc8c)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu
main: seed  = 1720080938
llama_model_loader: loaded meta data with 24 key-value pairs and 195 tensors from ./Phi-3-mini-128k-instruct-fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 131072
llama_model_loader: - kv   3:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   4:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi3.block_count u32              = 32
llama_model_loader: - kv   6:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  130 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 7.12 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.33 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   187.88 MiB
llm_load_tensors:      CUDA0 buffer size =  6048.66 MiB
llm_load_tensors:      CUDA1 buffer size =  1051.98 MiB
.....................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   168.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =    24.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   104.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   110.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    10.02 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 3

system_info: n_threads = 128 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
main: interactive mode on.
Reverse prompt: 'User:'
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 1


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

The text was updated successfully, but these errors were encountered:

RunningLeon · 2024-07-04T08:28:17Z

This happens on internlm2 too as mentioned in huggingface disscusion

RunningLeon · 2024-07-04T09:51:47Z

seems gemma-7b-it also have this issue randomely

script

./build/bin/main -m ./gemma-7b-it.gguf -n 256 -ngl 999 --color -i -r "User:" -f prompts/chat-with-bob.txt

log

main: build = 1 (917dc8c)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu
main: seed  = 1720086337
llama_model_loader: loaded meta data with 30 key-value pairs and 254 tensors from ./gemma-7b-it.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   4:                          gemma.block_count u32              = 28
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  24:             tokenizer.ggml.prefix_token_id u32              = 67
llama_model_loader: - kv  25:             tokenizer.ggml.suffix_token_id u32              = 69
llama_model_loader: - kv  26:             tokenizer.ggml.middle_token_id u32              = 68
llama_model_loader: - kv  27:                tokenizer.ggml.eot_token_id u32              = 107
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   57 tensors
llama_model_loader: - type  f16:  197 tensors
llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_rot            = 192
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.54 B
llm_load_print_meta: model size       = 15.90 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = gemma-7b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: PRE token        = 67 '<unused60>'
llm_load_print_meta: SUF token        = 69 '<unused62>'
llm_load_print_meta: MID token        = 68 '<unused61>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.39 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =  1500.00 MiB
llm_load_tensors:      CUDA0 buffer size =  7920.35 MiB
llm_load_tensors:      CUDA1 buffer size =  8364.32 MiB
.....................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   120.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   104.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   136.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   534.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    10.02 MiB
llama_new_context_with_model: graph nodes  = 931
llama_new_context_with_model: graph splits = 3

system_info: n_threads = 128 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
main: interactive mode on.
Reverse prompt: 'User:'
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 1

outputs

== Running in interactive mode. ==

Press Ctrl+C to interject at any time.
Press Return to return control to LLaMa.
To return control without starting a new line, end your input with '/'.
If you want to submit another line, end your input with ''.

Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:who are you
Bob: I am Bob, your friendly assistant. I'm here to help you with any questions or requests you may have.
User:do you speak Chinese
Bob: I am not capable of speaking languages, but I can provide you with information about various languages. Would you like me to tell you more about that?
User:
Can you write me a short story about a cat named Luna?
Bob: Of course. Here is a story about Luna, a mischievous cat who loves to play with balls of yarn.

End of Transcript

In this transcript, the user interacts with Bob in a friendly and conversational way. Bob is always willing to help and provides accurate and concise information. He is also good at writing and is able to write a short story about a cat named Luna.

The user's tone is positive and friendly, and Bob's tone is equally friendly and helpful. The conversation is well-structured and flows smoothly. It is clear that the user and Bob are enjoying their interaction.<eos>
then do you understand Chinese
Bob: I am not capable of speaking languages, but I can provide you with information about various languages. Would you like me to tell you more about that?

This snippet shows that
Bob is not able to speak languages, but he can provide information about various languages. He offers to tell the user more about languages, but the user does not want to hear about that.<eos>

dspasyuk · 2024-07-04T15:06:16Z

@ngxson @ggerganov This is the same issue as I have been seeing since version 3077 with lama3-instruct. Issue can be reproduce by running this command:

../llama.cpp/llama-cli --model ../../models/Meta-Llama-3-8B-Instruct_Q5_K_S.gguf --n-gpu-layers 35 -cnv --multiline-input --chat-template llama3

And feeding it list of questions like so several times:

Answer the following questions:

The day before two days after the day before tomorrow is Saturday. What day is it today?
What is the square root of 169?
Solve the equation 3y = 6y + 11 and find y.
There are two ducks in front of a duck, two ducks behind a duck, and a duck in the middle. How many ducks are there?
How many days does it take to travel from New York City to London by plane, assuming non-stop flights and average speeds?
What are the products of the chemical reaction between salicylic acid and acetic anhydride?
If five cats can catch five mice in five minutes, how long will it take one cat to catch one mouse?
Create a JS program that prints the first 100 Fibonacci numbers.

The model randomly stops generating output and would not resume proper dialog untill llama.cpp is restarted.

More on this issue here: #8253 (comment)

ngxson · 2024-07-04T15:23:12Z

@RunningLeon This is not a bug. You're using the model wrong way. Chat model must be use with proper conversation model:

./llama-cli -m ./Phi-3-mini-128k-instruct-fp16.gguf -ngl 999 -cnv

Also, you're using old version of llama.cpp (main binary is removed)

@dspasyuk Unless you can confirm that original doesn't have this behavior, we cannot confirm if it is bug of llama.cpp. I'm sure that original model is not trained to answer that much questions in one turn.

dspasyuk · 2024-07-04T15:37:02Z

@ngxson I do not need to prove anything just run the llama-cli with the standard llama-instruct model from Meta repo or any gguf repo in conversation with the commands I supplied and you will see this bug pop up. I can reproduce this behavior on 3 different PCs with 3 different Linux distros. This bug has been here since version 3077, output randomly stops and then the model either refuses to answer questions in full or outputs only a fraction of an answer.

ngxson · 2024-07-04T15:43:21Z

Can you prove if the original model doesn't do that? (i.e. run it with python transformers)

I can reproduce this behavior on 3 different PCs with 3 different Linux distros.

Then can you post the main.log? Result may also differ between CPU / GPU. It's best to isolate the problem instead of just say "it doesn't work"

And again, if the problem persists everywhere, then maybe that's the behavior of the original model, not llama.cpp.

This bug has been here since version 3077

Is there any logs for the version before 3077 ?

ngxson · 2024-07-04T15:52:25Z

@dspasyuk I think either your model is broken, or either GPU support has some problem. On latest master branch it answers me all question (running on CPU):

$ make llama-cli -j && ./llama-cli -m ../models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -cnv -c 4096 -p "You are a helpful assistant" --multiline-input

> Answer the following questions:

The day before two days after the day before tomorrow is Saturday. What day is it today?
What is the square root of 169?
Solve the equation 3y = 6y + 11 and find y.
There are two ducks in front of a duck, two ducks behind a duck, and a duck in the middle. How many ducks are there?
How many days does it take to travel from New York City to London by plane, assuming non-stop flights and average speeds?
What are the products of the chemical reaction between salicylic acid and acetic anhydride?
If five cats can catch five mice in five minutes, how long will it take one cat to catch one mouse?
Create a JS program that prints the first 100 Fibonacci numbers. 
I'd be happy to help you with that!

1. The day before two days after the day before tomorrow is Saturday.

To solve this, let's break it down:

* The day before tomorrow is Friday.
* Two days after Friday is Sunday.
* The day before Sunday is Saturday.

So, today is Friday.

2. The square root of 169 is 13, since 13 × 13 = 169.

3. Solve the equation 3y = 6y + 11.

First, subtract 6y from both sides to get:

3y - 6y = 11

This simplifies to:

-y = 11

Next, multiply both sides by -1 to get:

y = -11

So, the value of y is -11.

4. There are two ducks in front of a duck, two ducks behind a duck, and a duck in the middle.

Let's count the ducks:

* In front: 2 ducks
* Behind: 2 ducks
* Middle: 1 duck

In total, there are 2 + 2 + 1 = 5 ducks.

5. The flight duration from New York City to London depends on several factors, such as the airline, flight class, and weather conditions. However, on average, a non-stop flight from New York City (JFK or LGA) to London (LHR) takes around 7-8 hours.

6. The chemical reaction between salicylic acid (C7H6O3) and acetic anhydride (C8H8O3) is:

C7H6O3 + C8H8O3 → C6H4(OCH3)CO2CH3 + H2O

This reaction produces aspirin (C6H4(OCH3)CO2CH3) and water (H2O).

7. If five cats can catch five mice in five minutes, how long will it take one cat to catch one mouse?

Since five cats can catch five mice in five minutes, one cat can catch one mouse in the same time:

5 cats × 5 minutes = 25 minutes

Divide the total time by the number of cats to get the time it takes for one cat:

25 minutes ÷ 5 = 5 minutes

So, it will take one cat 5 minutes to catch one mouse.

8. Here's a JavaScript program that prints the first 100 Fibonacci numbers:
javascript
function fibonacci(n) {
  let fib = [0, 1];
  for (let i = 2; i < n; i++) {
    fib.push(fib[i-1] + fib[i-2]);
  }
  return fib;
}

const fibNumbers = fibonacci(100);
console.log(fibNumbers);

This program uses a recursive function to generate the Fibonacci sequence up to the 100th number, and then logs the results to the console.

>

dspasyuk · 2024-07-04T15:58:32Z

@ngxson Like I said I can reproduce this bug on 3 separate systems (P4, A100, A4500 GPUs) with models converted from Meta repo or taken from other repos. The issue is random but if you run this questionnaire or just chat for about 5000 tokens you will see it. Keep pasting the questions and generate output it eventually happens. Only sometimes it happens on the first run.

ngxson · 2024-07-04T16:08:23Z

@dspasyuk If that's a problem, then I would like to fix.

But in this case we can't even know if the original model is trained to behave that way or not.

I won't reply until this is confirmed. This is very time-wasting for me to answer to issues without proper logging and debugging.

dspasyuk · 2024-07-04T17:19:42Z

@ngxson You are correct. The issues with sudden stopping I have seen in past weeks are gone in the new version and yesterday's version:

This works with no problem on multiple GPUs and CPUs for over 24k generated tokens:

../llama.cpp/llama-cli --model ../../models/Meta-Llama-3-8B-Instruct_Q4_K_S.gguf --n-gpu-layers 25 -cnv -b 2048 --ctx_size 0 --temp 0.5 --top_k 10 --multiline-input --chat-template llama3 --logdir ./

S1M0N38 · 2024-07-11T22:27:18Z

I encountered the same issue when forcing JSON schema on /chat/completions endpoints on the llama-server

llama-server --version

version: 3368 (dd07a123)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0

Here are the steps to reproduce it

Spawn llama-server with phi3

llama-server \
    --hf-repo "bartowski/Phi-3-mini-4k-instruct-GGUF" \
    --hf-file "Phi-3-mini-4k-instruct-Q6_K.gguf" \
    --alias "phi3:3.8b-mini-4k-instruct-q6_K" \
    --flash-attn \
    --port 11434

Send the following http POST request

curl -X POST "http://localhost:11434/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer ThisIsDummyKey" \
     -d '{
          "model": "phi3:3.8b-mini-4k-instruct-q6_K",
          "seed": 0,
          "temperature": 0.0,
          "response_format": {
            "type": "json_object",
            "schema": {
              "properties": {"result": {"type": "boolean"}},
              "required": ["result"],
              "type": "object"
            }
          },
          "messages": [
            {
              "role": "system",
              "content": "Respond with random value for result. Responde with JSON"
            },
            {
              "role": "user",
              "content": "Hello!"
            }
          ],
          "stream": false
        }'

And the following (formatted) JSON is returned

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "{\"result\": true}\n                    <|end|>",
        "role": "assistant"
      }
    }
  ],
  "created": 1720736537,
  "model": "phi3:3.8b-mini-4k-instruct-q6_K",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 27,
    "prompt_tokens": 21,
    "total_tokens": 48
  },
  "id": "chatcmpl-usBcYvGujKu8REc0iXIV3jaAIGEqRCSm"
}

Edit

As suggest by unsubscribe on HuggingFace, this problem can be related to the conversion to GGUF format. To verify this assumption, I've spawned up llama-server hosting the phi3:3.8b-mini-4k-instruct-q6_K provided by Ollama:

llama-server \
    --model /Users/simo/Developer/local-AI/ollama/models/blobs/sha256-da21ddd7865f62117733071fa62a3f92dadfde82d9c55804701aacc7cf72aab9 \
    --alias "phi3:3.8b-mini-4k-instruct-q6_K" \
    --flash-attn \
    --port 11434

(I got the blob filename by inspecting the Modelfile with ollama show --modelfile phi3:3.8b-mini-4k-instruct-q6_K)

Then performing the same HTTP request, it results in the following correct output

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "{\"result\": true}",
        "role": "assistant"
      }
    }
  ],
  "created": 1720767212,
  "model": "phi3:3.8b-mini-4k-instruct-q6_K",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 7,
    "prompt_tokens": 20,
    "total_tokens": 27
  },
  "id": "chatcmpl-p32rZC7Ai7NgkAjvuRdsLREUW0ZBx8ZF"
}

RunningLeon added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Jul 4, 2024

ngxson closed this as not planned Won't fix, can't repro, duplicate, stale Jul 4, 2024

bakkot mentioned this issue Sep 7, 2024

llama : set attrs of mislabelled EOT/EOM tokens #9348

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Phi3 output `<|end|>` randomly #8291

Bug: Phi3 output `<|end|>` randomly #8291

RunningLeon commented Jul 4, 2024 •

edited

Loading

RunningLeon commented Jul 4, 2024 •

edited

Loading

RunningLeon commented Jul 4, 2024

dspasyuk commented Jul 4, 2024

ngxson commented Jul 4, 2024

dspasyuk commented Jul 4, 2024

ngxson commented Jul 4, 2024

ngxson commented Jul 4, 2024 •

edited

Loading

dspasyuk commented Jul 4, 2024

ngxson commented Jul 4, 2024

dspasyuk commented Jul 4, 2024

S1M0N38 commented Jul 11, 2024 •

edited

Loading

Bug: Phi3 output <|end|> randomly #8291

Bug: Phi3 output <|end|> randomly #8291

Comments

RunningLeon commented Jul 4, 2024 • edited Loading

What happened?

Name and Version

convert

run

version

What operating system are you seeing the problem on?

Relevant log output

RunningLeon commented Jul 4, 2024 • edited Loading

RunningLeon commented Jul 4, 2024

script

log

outputs

dspasyuk commented Jul 4, 2024

ngxson commented Jul 4, 2024

dspasyuk commented Jul 4, 2024

ngxson commented Jul 4, 2024

ngxson commented Jul 4, 2024 • edited Loading

dspasyuk commented Jul 4, 2024

ngxson commented Jul 4, 2024

dspasyuk commented Jul 4, 2024

S1M0N38 commented Jul 11, 2024 • edited Loading

Bug: Phi3 output `<|end|>` randomly #8291

Bug: Phi3 output `<|end|>` randomly #8291

RunningLeon commented Jul 4, 2024 •

edited

Loading

RunningLeon commented Jul 4, 2024 •

edited

Loading

ngxson commented Jul 4, 2024 •

edited

Loading

S1M0N38 commented Jul 11, 2024 •

edited

Loading