Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run models with the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats at ARM device. #1117

Closed
gustrd opened this issue Sep 6, 2024 · 5 comments

Comments

@gustrd
Copy link

gustrd commented Sep 6, 2024

Describe the Issue
Upstream we have the new feature of ARM optimized models (Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8). I tried to run every one of them at my Snapdragon 8G1, but I was unable to run it with koboldcpp.

Additional Information:
Checking upstream I saw the new documentation (ggerganov#9321), that shows that some flags must be set at compilation. Can you please inform how to compile koboldcpp with those flags so I can try again?

To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`).

@gustrd gustrd changed the title Unable to run models with theQ4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formatsformats at ARM device. Unable to run models with the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats at ARM device. Sep 6, 2024
@LostRuins
Copy link
Owner

At the moment, there is no flag to remove llamafile. I will add one. For now, you need to remove all matches of this -DGGML_USE_LLAMAFILE from the makefile, and then rebuild

@Abhrant
Copy link

Abhrant commented Oct 3, 2024

Can we not just delete the llama.cpp folder, clone it again and run MAKE again?

@gustrd
Copy link
Author

gustrd commented Oct 3, 2024

With the last version I was able to run Q4_0_4_4 just compiling from the source. Thx!

@gustrd gustrd closed this as completed Oct 3, 2024
@Abhrant
Copy link

Abhrant commented Oct 3, 2024

@gustrd , which quantization exactly is Q4_0_4_4 ? What quantization config do you have to specify to run this ? And fast is it compared to other quantizations on ARM ?

@gustrd
Copy link
Author

gustrd commented Oct 4, 2024

@Abhrant , I'm not a specialist about it, but AFAIK Q4_0_4_4 is a special type of Q4 that takes advantage from some arm optimizations, present at some newer devices.

Q4_0_4_8 uses i8mm and Q4_0_8_8 uses SVC, that are even newer technologies.

I could just test Q4_0_4_4, and it got great prompt processing increase and minor generation speed increase.

With a Snapdragon 8G1 I'm getting around 35 t/s processing and 9 t/s generation, for 3B model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants