llama.cpp: loading model ......terminate called after throwing an instance of 'std::runtime_error' #303

mikeyang01 · 2023-05-31T07:03:23Z

langchain 0.0.184

Error happens on:
llama-cpp-python version: 0.1.53~0.1.56

Errror detail:

llama.cpp: loading model from /root/models/ggml-vic7b-q4_0.bin
terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpectedly reached end of file
Aborted (core dumped)

Works correctly on:
llama-cpp-python version: 0.1.52

Correct output:

llama.cpp: loading model from /root/models/ggml-vic7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

source code

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager

# Make sure the model path is correct for your system!
llm_cpp = LlamaCpp(model_path="/root/models/ggml-vic7b-q4_0.bin", callback_manager=callback_manager, verbose=True, n_ctx=2048)

My investigation:
maybe related to llama.cpp quantize isuue?
ggerganov/llama.cpp#1569
any ideas why this happen?

The text was updated successfully, but these errors were encountered:

christianwengert · 2023-06-01T06:32:47Z

I have a similar problem:

langchain==0.0.187
llama-cpp-python==0.1.57 # and also 0.1.56 but not 0.1.55

and

llm = LlamaCpp(model_path=model_path,
                            temperature=0.8,
                            n_threads=8,
                            n_ctx=n_ctx,
                            n_batch=512,
                            max_tokens=1024)

raises

    llm = LlamaCpp(model_path=model_path,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for InterruptableLlamaCpp
__root__
  Could not load Llama model from path: /Users/XXXXXXX/Downloads/Wizard-Vicuna-7B-Uncensored.ggmlv3.q5_0.bin. Received error cannot resize an array that references or is referenced
by another array in this way.
Use the np.resize function or refcheck=False (type=value_error)

This happens in llama_cpp/llama.py on line 225

self._candidates_data.resize(3, self._n_vocab)

christianwengert · 2023-06-01T06:43:15Z

Funny enough this happens only in debug mode, could be solved using

self._candidates_data.resize(3, self._n_vocab, refcheck=False)

…on#303

matthiasgeihs · 2023-06-02T21:44:31Z

having the same problem as mentioned in issue description ("unexpectedly reached end of file").

any solution to this yet?

matthiasgeihs · 2023-06-02T21:56:22Z

ok, this indeed seems to be related to the breaking change with respect to quantization: ggerganov/llama.cpp#1405 :(

i guess there is nothing we can do except for either using an old version or re-quantizing our models.
is there maybe a way to translate the old format to the new one?

gjmulder · 2023-06-03T07:51:51Z

Hopefully things have standardized on ggmlv3 for a while upstream. If you have the fp16 bin version of the model you can use the ./quantize utility in llama.cpp to requantize your models.

Alternatively, I wrote a script that provides a menu of model from 🤗 and allows you to directly download them. Without any args it defaults to (currently) a menu of 51 q5_1 quantized models kindly published by @TheBloke, most of which should be ggmlv3. There's an automatic version check after you download the model to confirm it is in fact ggmlv3 (can be overridden with the -v arg). You can also explicitly substring match on a filename to get a specific quantization level (e.g. -f q4_1):

docker/open_llama$ python ./hug_model.py --help
usage: hug_model.py [-h] [-v VERSION] [-a AUTHOR] [-t TAG] [-s SEARCH] [-f FILENAME]

Process some parameters.

options:
  -h, --help            show this help message and exit
  -v VERSION, --version VERSION
                        hexadecimal version number of ggml file
  -a AUTHOR, --author AUTHOR
                        HuggingFace author filter
  -t TAG, --tag TAG     HuggingFace tag filter
  -s SEARCH, --search SEARCH
                        HuggingFace search filter
  -f FILENAME, --filename FILENAME

gjmulder · 2023-06-14T19:04:46Z

Can we close this?

gjmulder added the bug Something isn't working label May 31, 2023

christianwengert added a commit to christianwengert/llama-server that referenced this issue Jun 1, 2023

Update dependencies, for llama-cpp-python: see abetlen/llama-cpp-pyth…

269224f

…on#303

christianwengert added a commit to christianwengert/llama-server that referenced this issue Jun 1, 2023

Update dependencies, for llama-cpp-python: see abetlen/llama-cpp-pyth…

c50188c

…on#303

mikeyang01 closed this as completed Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp: loading model ......terminate called after throwing an instance of 'std::runtime_error' #303

llama.cpp: loading model ......terminate called after throwing an instance of 'std::runtime_error' #303

mikeyang01 commented May 31, 2023 •

edited

Loading

christianwengert commented Jun 1, 2023 •

edited

Loading

christianwengert commented Jun 1, 2023

matthiasgeihs commented Jun 2, 2023

matthiasgeihs commented Jun 2, 2023

gjmulder commented Jun 3, 2023 •

edited

Loading

gjmulder commented Jun 14, 2023

llama.cpp: loading model ......terminate called after throwing an instance of 'std::runtime_error' #303

llama.cpp: loading model ......terminate called after throwing an instance of 'std::runtime_error' #303

Comments

mikeyang01 commented May 31, 2023 • edited Loading

christianwengert commented Jun 1, 2023 • edited Loading

christianwengert commented Jun 1, 2023

matthiasgeihs commented Jun 2, 2023

matthiasgeihs commented Jun 2, 2023

gjmulder commented Jun 3, 2023 • edited Loading

gjmulder commented Jun 14, 2023

mikeyang01 commented May 31, 2023 •

edited

Loading

christianwengert commented Jun 1, 2023 •

edited

Loading

gjmulder commented Jun 3, 2023 •

edited

Loading