Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main: build = 1336 (9ca79d5) - Load mistral-7b-openorca.Q8_0.gguf - after first prompt "hello" llama crashing - windows build - some time ago was ok - 30 builds before? #3516

Closed
mirek190 opened this issue Oct 6, 2023 · 7 comments

Comments

@mirek190
Copy link

mirek190 commented Oct 6, 2023

main.exe --model models\new3\mistral-7b-openorca.Q8_0.gguf --mlock --color --threads 16 --keep -1 --batch_size 512 --n_predict -1 --top_k 40 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 32768 --interactive --instruct --reverse-prompt "<|im_end|>" -ngl 48 --simple-io  --in-prefix "<|im_start|>user " --in-suffix "<|im_end|> " -p "<|im_start|>system You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!<|im_end|> "

Log start
main: build = 1336 (9ca79d5)
main: built with MSVC 19.35.32217.1 for x64
main: seed  = 1696634023
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data

llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name   = open-orca_mistral-7b-openorca
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<dummy32000>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  132.91 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7205.84 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 4096.00 MB
llama_new_context_with_model: kv self size  = 4096.00 MB
llama_new_context_with_model: compute buffer total size = 2141.88 MB
llama_new_context_with_model: VRAM scratch buffer: 2136.00 MB
llama_new_context_with_model: total VRAM used: 13437.84 MB (model: 7205.84 MB, context: 6232.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '<|im_end|>'
Reverse prompt: '### Instruction:

'
Input prefix: '<|im_start|>user '
Input suffix: '<|im_end|> '
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.960000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 32768, n_batch = 512, n_predict = -1, n_keep = 54


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 <|im_start|>system You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!<|im_end|>
> <|im_start|>user hello
<|im_end|>  Hello! I'm MistralOrca, a large language model developed by Alignment Lab AI. I'm here to help you with any questions or tasks you may have.GGML_ASSERT: D:\a\llama.cpp\llama.cpp\llama.cpp:8203: false
PS E:\LLAMA\llama.cpp>   
@cwillu
Copy link

cwillu commented Oct 6, 2023

Believe this is #3454

See also: #3455

@mirek190
Copy link
Author

mirek190 commented Oct 6, 2023

I used newest the boke's build of mistral-7b-openorca.Q8_0.gguf and still crashing the same way ...

llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name   = open-orca_mistral-7b-openorca
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<dummy32000>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  132.91 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7205.84 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 4096.00 MB
llama_new_context_with_model: kv self size  = 4096.00 MB
llama_new_context_with_model: compute buffer total size = 2141.88 MB
llama_new_context_with_model: VRAM scratch buffer: 2136.00 MB
llama_new_context_with_model: total VRAM used: 13437.84 MB (model: 7205.84 MB, context: 6232.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '<|im_end|>'
Reverse prompt: '### Instruction:

'
Input prefix: '<|im_start|>user '
Input suffix: '<|im_end|> '
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.960000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 32768, n_batch = 512, n_predict = -1, n_keep = 54


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 <|im_start|>system You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!<|im_end|>
> <|im_start|>user hello
<|im_end|>  Hello! I am MistralOrca, a large language model developed by Alignment Lab AI. How may I assist you today?GGML_ASSERT: D:\a\llama.cpp\llama.cpp\llama.cpp:8203: false
PS E:\LLAMA\llama.cpp>

older llama builds hadn't such problems

@staviq
Copy link
Collaborator

staviq commented Oct 6, 2023

Just a genuinely friendly tip: No matter the project or repo or website, program or app, not just here, never write "latest" or "newest" in a bug report. Tomorrow or even later the same day it will no longer be "latest", updates can happen several times a day. You are forcing people to manually guess your version by matching date/time of your post with date/times of commits, finding text in logs they can extrapolate the version from. Not including a build or version almost always guarantees a bug report will be ignored.

main prints version and build info which you omitted so please provide that info.

@staviq
Copy link
Collaborator

staviq commented Oct 6, 2023

@cwillu

Believe this is #3454

See also: #3455

You got fooled by the sneaky link in the PR name, didn't you ? :)

@mirek190
Copy link
Author

mirek190 commented Oct 7, 2023

main: build = 1336 (9ca79d5)

Also added to the first cement.

@mirek190 mirek190 changed the title newest llama - Load mistral-7b-openorca.Q8_0.gguf - after first prompt "hello" llama crashing - windows build - some time ago was ok - 30 builds before? main: build = 1336 (9ca79d5) - Load mistral-7b-openorca.Q8_0.gguf - after first prompt "hello" llama crashing - windows build - some time ago was ok - 30 builds before? Oct 7, 2023
@staviq
Copy link
Collaborator

staviq commented Oct 7, 2023

main: build = 1336 (9ca79d5)

Also added to the first cement.

Thank you.

You need to update llama.cpp, this was already fixed.

@mirek190
Copy link
Author

mirek190 commented Oct 7, 2023

Confirm - newer builds for instance like main: build = 1342 (f1782c6) works fine.

@mirek190 mirek190 closed this as completed Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants