Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks for llama quantised models with gguf #844

Closed
okpatil4u opened this issue Sep 14, 2023 · 7 comments
Closed

Benchmarks for llama quantised models with gguf #844

okpatil4u opened this issue Sep 14, 2023 · 7 comments

Comments

@okpatil4u
Copy link

okpatil4u commented Sep 14, 2023

M2 Ultra 26 Cores 64 GB

With Candle

Model Question Threads CPU Time taken Token/sec
Codellama-7b.Q5_K_M.gguf Quick sort implementation in c 1 98% 37.003 3.1
    2 174% 21.572 5.4
    4 287% 13.771 8.75
    8 454% 11.198 11.01
    20 1159% 12.58 9.65
Codellama-34b.Q5_K_M.gguf Quick sort implementation in c 1 99% 2:56.08 0.65
    2 181% 1:33.53 1.26
    4 313% 54.522 2.28
    8 509% 38.97 3.42
    20 966% 37.286 3.62

vs Llama.cpp

Model Question GPU Cores used CPU utilization eval tokens/sec
codellama-7b-python.Q5_K_M.gguf Explain how embeddings work in large language models   20 1790% CPU 18.58
ngl 1   23% CPU 64.25
wizardcoder-python-34b-v1.0.Q5_K_.gguf Explain how embeddings work in large language models   20 1793% CPU 4.32
ngl 1   10% CPU 18.31

Any pointers on how one can improve performance through candle ?

Also, I am trying to implement speculative sampling through candle. Do you think if the implementation is feasible ?

@LaurentMazare
Copy link
Collaborator

Are these properly excluding the initial prompt evaluation time on both sides?
The 3.62 token/s (candle) vs 4.32 token/s (llama.cpp) doesn't actually look that bad. You could try to enable the --tracing mode to dig a bit on where the time is spent.
The 9.65 token/s vs 18.58 token/s is a lot worse, so probably worth investigating this one if you can.
Also just to be sure, on the candle side you're using --features accelerate right?

The gpu support isn't available yet for mac so all the candle numbers are cpu only (+ maybe the neural engine with accelerate).

@okpatil4u
Copy link
Author

okpatil4u commented Sep 14, 2023 via email

@LaurentMazare
Copy link
Collaborator

I imagine that it might be possible but not easy, we're likely to add gpu support before that.
Also just to mention that the goal of the quantized example is not really to provide a full featured llama.cpp equivalent but rather be an example of how to use quantized models. So we would prefer not to add too much complexity there, and would certainly be happy if some new projects are created to build a more feature complete and performant version.

@okpatil4u
Copy link
Author

okpatil4u commented Sep 14, 2023 via email

@LaurentMazare
Copy link
Collaborator

Well I'm not very familiar with the details, but I don't see a reason why it wouldn't. Running with multiple elements in a batch should be well supported and I think its the only required thing on the candle side? Best is probably to give it a try and see what happens :)
Just let us know if you run into any issues or need anything more inside candle and we can certainly have a look!

@LLukas22
Copy link
Contributor

@okpatil4u The qmatmul implementation is currently far from optimal, and could probably be improved with some better thread management and better allocation. Feel free to look into it.

@okpatil4u
Copy link
Author

okpatil4u commented Sep 14, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants