Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama.cpp: loading model ......terminate called after throwing an instance of 'std::runtime_error' #303

Closed
mikeyang01 opened this issue May 31, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@mikeyang01
Copy link
Contributor

mikeyang01 commented May 31, 2023

langchain 0.0.184

Error happens on:
llama-cpp-python version: 0.1.53~0.1.56

Errror detail:

llama.cpp: loading model from /root/models/ggml-vic7b-q4_0.bin
terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpectedly reached end of file
Aborted (core dumped)

Works correctly on:
llama-cpp-python version: 0.1.52

Correct output:

llama.cpp: loading model from /root/models/ggml-vic7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

source code

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager

# Make sure the model path is correct for your system!
llm_cpp = LlamaCpp(model_path="/root/models/ggml-vic7b-q4_0.bin", callback_manager=callback_manager, verbose=True, n_ctx=2048)

My investigation:
maybe related to llama.cpp quantize isuue?
ggerganov/llama.cpp#1569
any ideas why this happen?

@gjmulder gjmulder added the bug Something isn't working label May 31, 2023
@christianwengert
Copy link

christianwengert commented Jun 1, 2023

I have a similar problem:

langchain==0.0.187
llama-cpp-python==0.1.57 # and also 0.1.56 but not 0.1.55

and

llm = LlamaCpp(model_path=model_path,
                            temperature=0.8,
                            n_threads=8,
                            n_ctx=n_ctx,
                            n_batch=512,
                            max_tokens=1024)

raises

    llm = LlamaCpp(model_path=model_path,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for InterruptableLlamaCpp
__root__
  Could not load Llama model from path: /Users/XXXXXXX/Downloads/Wizard-Vicuna-7B-Uncensored.ggmlv3.q5_0.bin. Received error cannot resize an array that references or is referenced
by another array in this way.
Use the np.resize function or refcheck=False (type=value_error)

This happens in llama_cpp/llama.py on line 225

self._candidates_data.resize(3, self._n_vocab)

@christianwengert
Copy link

Funny enough this happens only in debug mode, could be solved using

self._candidates_data.resize(3, self._n_vocab, refcheck=False)

christianwengert added a commit to christianwengert/llama-server that referenced this issue Jun 1, 2023
christianwengert added a commit to christianwengert/llama-server that referenced this issue Jun 1, 2023
@matthiasgeihs
Copy link

having the same problem as mentioned in issue description ("unexpectedly reached end of file").

any solution to this yet?

@matthiasgeihs
Copy link

ok, this indeed seems to be related to the breaking change with respect to quantization: ggerganov/llama.cpp#1405 :(

i guess there is nothing we can do except for either using an old version or re-quantizing our models.
is there maybe a way to translate the old format to the new one?

@gjmulder
Copy link
Contributor

gjmulder commented Jun 3, 2023

Hopefully things have standardized on ggmlv3 for a while upstream. If you have the fp16 bin version of the model you can use the ./quantize utility in llama.cpp to requantize your models.

Alternatively, I wrote a script that provides a menu of model from 🤗 and allows you to directly download them. Without any args it defaults to (currently) a menu of 51 q5_1 quantized models kindly published by @TheBloke, most of which should be ggmlv3. There's an automatic version check after you download the model to confirm it is in fact ggmlv3 (can be overridden with the -v arg). You can also explicitly substring match on a filename to get a specific quantization level (e.g. -f q4_1):

docker/open_llama$ python ./hug_model.py --help
usage: hug_model.py [-h] [-v VERSION] [-a AUTHOR] [-t TAG] [-s SEARCH] [-f FILENAME]

Process some parameters.

options:
  -h, --help            show this help message and exit
  -v VERSION, --version VERSION
                        hexadecimal version number of ggml file
  -a AUTHOR, --author AUTHOR
                        HuggingFace author filter
  -t TAG, --tag TAG     HuggingFace tag filter
  -s SEARCH, --search SEARCH
                        HuggingFace search filter
  -f FILENAME, --filename FILENAME

@gjmulder
Copy link
Contributor

Can we close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants