Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] mistral-7b-openorca crashes main.exe after BPE update. #3454

Closed
4 tasks done
MaggotHATE opened this issue Oct 3, 2023 · 11 comments
Closed
4 tasks done

[bug] mistral-7b-openorca crashes main.exe after BPE update. #3454

MaggotHATE opened this issue Oct 3, 2023 · 11 comments

Comments

@MaggotHATE
Copy link
Contributor

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

mistral-7b-openorca.Q4_K_S.gguf works correctly, as it was before the BPE update.

Current Behavior

mistral-7b-openorca.Q4_K_S.gguf crashes main.exe after entering (and processing?) the prompt.

Additionally, I've merged that commit into my own chat project (slightly rewritten main example), and it generates, but crashes at the end of generation (eos issue?).

  • Physical (or virtual) hardware you are using:

i5 3470 (AVX only).

  • Operating System:

Windows 8.1

  • Environment:

Compiled with w64devkit-fortran-1.20.0
Additionally, I've tested it and got the same crash with main.exe from b1311 AVX release.

Failure Information (for bugs)

The crash message points at llama.cpp, line 7716, GGML_ASSERT(false);

Failure Logs

[1696334675] Log start
[1696334675] Cmd: main -t 3 -m F:/GGML/test/models/mistral_7b_openorca_Q4_K_S.gguf -p "system: complete the given task with precision, adding methodical explanations. user:" --temp 0.9 --repeat_penalty 1.133 --top-p 0.7 -r user: --interactive-first
[1696334675] main: build = 0 (unknown)
[1696334675] main: built with cc (GCC) 13.1.0 for x86_64-w64-mingw32
[1696334675] main: seed  = 1696334675
[1696334675] main: llama backend init
[1696334675] main: load the model and apply lora adapter, if any
[1696334676] warming up the model with an empty run
[1696334677] n_ctx: 512
[1696334677] 
[1696334677] system_info: n_threads = 3 / 4 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
[1696334677] add_bos: 1
[1696334677] tokenize the prompt
[1696334677] prompt: "system: complete the given task with precision, adding methodical explanations. user:"
[1696334677] tokens: [ '':1, ' system':1587, ':':28747, ' complete':4160, ' the':272, ' given':2078, ' task':3638, ' with':395, ' precision':16021, ',':28725, ' adding':8833, ' method':2038, 'ical':745, ' explan':10928, 'ations':697, '.':28723, ' user':2188, ':':28747 ]
[1696334677] recalculate the cached logits (check): embd_inp.empty() false, n_matching_session_tokens 0, embd_inp.size() 18, session_tokens.size() 0, embd_inp.size() 18
[1696334677] inp_pfx: [ '':1, ' ':28705, '':13, '':13, '###':27332, ' Inst':3133, 'ruction':3112, ':':28747, '':13, '':13 ]
[1696334677] inp_sfx: [ ' ':28705, '':13, '':13, '###':27332, ' Response':12107, ':':28747, '':13, '':13 ]
[1696334677] main: interactive mode on.
[1696334677] Reverse prompt: 'user:'
[1696334677] sampling: repeat_last_n = 64, repeat_penalty = 1.133000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.700000, typical_p = 1.000000, temp = 0.900000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
[1696334677] generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
[1696334677] 

[1696334677] == Running in interactive mode. ==
[1696334677]  - Press Ctrl+C to interject at any time.
[1696334677]  - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

[1696334677] embd_inp.size(): 18, n_consumed: 0
[1696334677] found antiprompt: ▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅ system: complete the given task with precision, adding methodical explanations. user:
[1696334677] eval: [ '':1, ' system':1587, ':':28747, ' complete':4160, ' the':272, ' given':2078, ' task':3638, ' with':395, ' precision':16021, ',':28725, ' adding':8833, ' method':2038, 'ical':745, ' explan':10928, 'ations':697, '.':28723, ' user':2188, ':':28747 ]
[1696334681] n_past = 18
[1696334681] embd_inp.size(): 18, n_consumed: 18
[1696334681] found antiprompt: ▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅ system: complete the given task with precision, adding methodical explanations. user:
[1696334681] waiting for user input
[1696334689] buffer: 'Write a joke about llamas.
'
[1696334689] input tokens: [ ' Write':12018, ' a':264, ' joke':13015, ' about':684, ' llam':17620, 'as':293, '.':28723, '':13 ]
[1696334689] n_remain: -9
[1696334689] embd_inp.size(): 26, n_consumed: 18
[1696334689] eval: [ ' Write':12018, ' a':264, ' joke':13015, ' about':684, ' llam':17620, 'as':293, '.':28723, '':13 ]
[1696334691] n_past = 26
[1696334691] top 10 candidates:
@goerch
Copy link
Collaborator

goerch commented Oct 3, 2023

mistral-7b-openorca.Q4_K_S.gguf crashes main.exe after entering (and processing?) the prompt.

I'm not surprised, if the model is using a GPT2 based tokenizer. How do we convert mistral-7b-openorca (I haven't found a specific conversion script in the repository)?

The crash message points at llama.cpp, line 7716, GGML_ASSERT(false);

OK, so the model seems to use a sentencepiece tokenizer and the function tries to handle a token which is neither NORMAL, UNKNOWN, CONTROL or BYTE. Does the vocabulary contain USER_DEFINED or UNUSED tokens?

@MaggotHATE
Copy link
Contributor Author

How do we convert mistral-7b-openorca (I haven't found a specific conversion script in the repository)?

I used TheBloke's converted verison, if that helps.

@staviq
Copy link
Collaborator

staviq commented Oct 3, 2023

@goerch I can reproduce. Anything you would like me to check ? I believe this mode adds <|im_start|> <|im_end|> tokens

Edit: from model page: https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/raw/main/added_tokens.json

{
  "</s>": 2,
  "<s>": 1,
  "<unk>": 0,
  "<|im_end|>": 32000,
  "<|im_start|>": 32001
}

@goerch
Copy link
Collaborator

goerch commented Oct 3, 2023

@goerch I can reproduce. Anything you would like me to check ? I believe this mode adds <|im_start|> <|im_end|> tokens

Edit: from model page: https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/raw/main/added_tokens.json

{
  "</s>": 2,
  "<s>": 1,
  "<unk>": 0,
  "<|im_end|>": 32000,
  "<|im_start|>": 32001
}

Thanks, something like convert.py probably adds these as USER_DEFINED, but unfortunately we don't know the real sentencepiece token type and I neither have an understanding of the semantics of USER_DEFINED nor a test case for it.

To avoid further damage I tend to disable these assertions in token_to_piece, which would mean all unsupported token types behave like CONTROL tokens.

Edit: what I don't like is in our current logic is that even <unk>, <s> and </s> would end up with token type USER_DEFINED IIUC.

goerch added a commit to goerch/llama.cpp that referenced this issue Oct 3, 2023
@goerch
Copy link
Collaborator

goerch commented Oct 3, 2023

Anything you would like me to check ?

It would be great if you could check #3455.

@staviq
Copy link
Collaborator

staviq commented Oct 3, 2023

Anything you would like me to check ?

It would be great if you could check #3455.

So your fix works, however naively changing USER_DEFINED to CONTROL in

yield text.encode("utf-8"), score, gguf.TokenType.USER_DEFINED
works too, and produces a model compatible with current version without modifications ( paging @TheBloke )

@TheBloke
Copy link
Contributor

TheBloke commented Oct 3, 2023

So do I need to re-make OpenOrca Mistral GGUF? For the FOURTH time? 🤣 (they kept updating the JSON files with tokenizer changes, so I ended up making them three times yesterday)

Or are you asking me to test if this PR works with the existing GGUFs?

@staviq
Copy link
Collaborator

staviq commented Oct 3, 2023

Or are you asking me to test if this PR works with the existing GGUFs?

(Edit: pr is #3455)

I already tested it and it does

This PR should make already converted models work, but the change in convert.py produces a model which works with or without this PR

In case people start reporting broken conversion, the solution is either to wait for this PR to get merged, or redo the conversion with modified convert.py

So I guess the choice is yours, whether you want people to aim their pitchforks at you or llamacpp :)

@slaren
Copy link
Collaborator

slaren commented Oct 3, 2023

Well, once support for SWA is added, Mistral models will probably need to be converted again to add it to the metadata.

@goerch
Copy link
Collaborator

goerch commented Oct 4, 2023

so I ended up making them three times yesterday

Using convert.py? Thanks!

goerch added a commit that referenced this issue Oct 7, 2023
Fix: `sentencepiece` tokenizers with added tokens failed with an incorrect assertion
yusiwen pushed a commit to yusiwen/llama.cpp that referenced this issue Oct 7, 2023
Fix: `sentencepiece` tokenizers with added tokens failed with an incorrect assertion
joelkuiper added a commit to vortext/llama.cpp that referenced this issue Oct 12, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp:
  py : change version of numpy requirement to 1.24.4 (ggerganov#3515)
  quantize : fail fast on write errors (ggerganov#3521)
  metal : support default.metallib load & reuse code for swift package (ggerganov#3522)
  llm : support Adept Persimmon 8B (ggerganov#3410)
  Fix for ggerganov#3454 (ggerganov#3455)
  readme : update models, cuda + ppl instructions (ggerganov#3510)
  server : docs fix default values and add n_probs (ggerganov#3506)
@ggerganov
Copy link
Owner

I believe the issue is resolved now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants