Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speculative sampling #675

Closed
Tracked by #487
andriyanthon opened this issue Sep 7, 2023 · 9 comments · Fixed by #1120
Closed
Tracked by #487

Speculative sampling #675

andriyanthon opened this issue Sep 7, 2023 · 9 comments · Fixed by #1120
Labels
enhancement New feature or request

Comments

@andriyanthon
Copy link

llama.cpp added a feature for speculative inference:
ggerganov/llama.cpp#2926
but when running llama_cpp.server, it says it does not recognize the new parameters.

There are two new parameters:

  1. -md (model_draft) - the path to the draft model.
  2. -draft (n_draft) - how many tokens to draft each time

Can this new feature please be supported?

@abetlen abetlen changed the title Please add the speculative inference function to llama_cpp.server Speculative sampling Sep 8, 2023
@abetlen abetlen added the enhancement New feature or request label Sep 8, 2023
@abetlen
Copy link
Owner

abetlen commented Sep 8, 2023

@andriyanthon good idea, I'll take a look into this. I think a similar API to hugginface's Assisted Generation would work well.

@Chainfire
Copy link

+1. Would probably double performance in my setup.

@galatolofederico
Copy link

+1. It would be very useful

@abetlen abetlen pinned this issue Sep 14, 2023
@gssci
Copy link

gssci commented Sep 30, 2023

Any updates on this?

@abetlen abetlen mentioned this issue Sep 29, 2023
9 tasks
@LynxPDA
Copy link

LynxPDA commented Oct 2, 2023

+1

Also added the -ngld parameter which tells how many layers to unload in VRAM for Draft model.

On hardware:

  • Ryzen 7950X
  • DDR5 5800
  • RTX 3060 12Gb

For me, the acceleration amounted to phind-codellama-34b-v2.Q4_K_M.gguf

  • Only CPU - 1.88x (3.3-->6.2)
  • GPU+CPU - 1.78x (5.4-->9.6)

An example of the full CLI in llama.cpp and the results for me are below:

./speculative -m ./models/phind-codellama-34b-v2.Q4_K_M.gguf -md ./models/codellama-7b-instruct.Q3_K_M.gguf  -p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" -e -t 16 -n 256 -c 2048 -s 8 --draft 15 -b 512 **-ngld 35 -ngl 15**

And the results:

encoded   25 tokens in    1.490 seconds, speed:   16.780 t/s
decoded  280 tokens in   29.131 seconds, speed:    9.612 t/s

n_draft   = 27
n_predict = 280
n_drafted = 334
n_accept  = 245
accept    = 73.353%

draft:

llama_print_timings: load time =  538.74 ms
llama_print_timings: sample time =   522.77 ms / 1 runs (  522.77 ms per token, 1.91 tokens per second)
llama_print_timings: prompt eval time =   171.93 ms / 25 tokens  ( 6.88 ms per token, 145.41 tokens per second)
llama_print_timings: eval time =  7165.94 ms / 360 runs ( 19.91 ms per token, 50.24 tokens per second)
llama_print_timings:  total time = 30621.07 ms

target:

llama_print_timings: load time =  1015.12 ms
llama_print_timings: sample time =    91.54 ms / 280 runs ( 0.33 ms per token 3058.81 tokens per second)
llama_print_timings: prompt eval time = 20972.04 ms / 386 tokens  ( 54.33 ms per token, 18.41 tokens per second)
llama_print_timings: eval time =  1603.66 ms / 7 runs  ( 229.09 ms per token 4.37 tokens per second)
llama_print_timings: total time = 31163.91 ms

@abetlen abetlen unpinned this issue Nov 6, 2023
@abetlen abetlen pinned this issue Nov 10, 2023
@oobabooga
Copy link
Contributor

I have made some speculative decoding tests with the following models on my RTX 3090:

  • Target: wizardlm-70b-v1.0.Q4_K_S.gguf
  • Draft: tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf

With speculative, I get 3.41 tokens/second, while without it I get 2.08 tokens/second. That's a +64% increase.

This is the command that I used:

./speculative \
  -m ../models/wizardlm-70b-v1.0.Q4_K_S.gguf \
  -md ../models/tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf \
  -e \
  -t 6 \
  -tb 12 \
  -n 256 \
  -c 4096 \
  --draft 15 \
  -ngld 128 \
  -ngl 42 \
  -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Give me an example of Python script.\nASSISTANT:"

Having this feature available in llama-cpp-python would be amazing.

@rangehow
Copy link

rangehow commented Jan 8, 2024

This features looks so cool : ) Looking forward to this!

@abetlen
Copy link
Owner

abetlen commented Jan 23, 2024

#1120 is almost ready, need to do some more testing and perf benchmarks but it works now with prompt lookup decoding.

@Andy1314Chen
Copy link

#1120 is almost ready, need to do some more testing and perf benchmarks but it works now with prompt lookup decoding.

This features looks so cool! How can we make it support more speculative decoding? not just prompt lookup decoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants