-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starcoder / Quantized Issues #1
Comments
Hi, I think it is due to the breaking change introduced in the quantization formats in the GGML library in ggerganov/ggml#154 yesterday. Can you please try doing the quantization from the ggml submodule of this repo and let me know if it works: git clone --recurse-submodules https://github.com/marella/ctransformers
cd ctransformers/models/ggml
cmake -S . -B build
cmake --build build
./build/bin/starcoder-quantize # specify path to model and quantization type If I pull the latest changes, I think old models will stop working with this library. So I'm thinking of waiting for sometime for people to convert and provide models in the new format before pulling the changes. |
Awesome! Thank you @marella this is definitely the issue. I wish there were more clear ways to version the various quantizations -- I'm new to the ggml toolkit and so I didn't realize how breaking changes to the quantization would manifest. I'll also add, the ability to pull directly from huggingface makes this super great, thank you! |
Hi @bluecoconut @marella coudl you provide an exmaple, been trying to execute the model but have no luck, none of this prompts work:
before adding type I was doing:
And getting:
I don't have a specific problem using the module from transformers, but in a Mac M1 Pro with 64 GB memory inference can take more than 10 minutes, is this correct? |
Hi @bgonzalezfractal, ./build/bin/starcoder -m ./starcoder-ggml/starcoder-ggml-q4_1.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2 Apple M1 processor doesn't support AVX2/AVX instructions so it will be slower. You can try increasing the llm(..., threads=8) You can use this command to get cpu cores count: grep -m 1 'cpu cores' /proc/cpuinfo Also I just updated the build file in this repo. Can you please pull the latest changes or clone this repo and try building the library from source: git clone --recurse-submodules https://github.com/marella/ctransformers
cd ctransformers
cmake -S . -B build
cmake --build build The compiled binary for the library will be located at llm = AutoModelForCausalLM.from_pretrained(..., lib='/path/to/ctransformers/build/lib/libctransformers.dylib')
llm(..., threads=8) Can you please try this and let me know if you are seeing any improvement in performance. |
@bluecoconut Just an FYI: There is a new breaking change in quantization formats added to llama.cpp in ggerganov/llama.cpp#1508 yesterday. Initially I was planning to update to the latest version over the weekend but now I will have to wait for the new breaking changes to be added to the ggml repo. |
@marella I was able to get the model tu run with the command line, but I have no success using the library, I get "segmentation fault python" using transformers. For the build I used:
Since the Cmake files were placed in the models folders. When building ggml from source tag v0.1.2, then text generations works fine: |
@bgonzalezfractal Recently I updated the GGML library which has breaking changes to quantization formats. So old models have to be re-quantized. Let's continue the discussion here. @bluecoconut This is released in the latest version 0.2.0 It includes the latest quantization changes and the recent fix for StarCoder ggerganov/ggml#176 Since it includes breaking changes, old models have to be re-quantized. It also supports LLaMA, MPT models now. |
@marella Can you confirm these steps would work:
|
I don't think starcoder.cpp repo has updated to the latest GGML version so it might not work. The steps on GGML repo should work with the latest version of ctransformers: https://github.com/ggerganov/ggml/tree/master/examples/starcoder#quick-start |
@marella, quantized starcoder again, works at 183.36 ms per token: Any luck? |
Hey! Thanks for this library, I really appreciate the API and simplicity you are bringing to this, it's exactly what I was looking for in trying to integrate ggml models into python! (specifically into my library
lambdaprompt
.One issue, it seems like there's something going wrong with starcoder quantized models.
For the full model, it seems to work great, and I'm getting the same outputs it seems.
What works (full model weights):
./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2
as equivalent to:
Seem to give equivalent results!
What fails (quantized model weights):
However, when I change to the quantized model (to reproduce the same as this)
./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2
I get a core dumped ggml error
The text was updated successfully, but these errors were encountered: