Speculative sampling #675

andriyanthon · 2023-09-07T06:36:41Z

llama.cpp added a feature for speculative inference:
ggerganov/llama.cpp#2926
but when running llama_cpp.server, it says it does not recognize the new parameters.

There are two new parameters:

-md (model_draft) - the path to the draft model.
-draft (n_draft) - how many tokens to draft each time

Can this new feature please be supported?

abetlen · 2023-09-08T19:45:30Z

@andriyanthon good idea, I'll take a look into this. I think a similar API to hugginface's Assisted Generation would work well.

Chainfire · 2023-09-13T01:48:34Z

+1. Would probably double performance in my setup.

galatolofederico · 2023-09-13T09:17:15Z

+1. It would be very useful

gssci · 2023-09-30T10:47:33Z

Any updates on this?

LynxPDA · 2023-10-02T14:36:40Z

+1

Also added the -ngld parameter which tells how many layers to unload in VRAM for Draft model.

On hardware:

Ryzen 7950X
DDR5 5800
RTX 3060 12Gb

For me, the acceleration amounted to phind-codellama-34b-v2.Q4_K_M.gguf

Only CPU - 1.88x (3.3-->6.2)
GPU+CPU - 1.78x (5.4-->9.6)

An example of the full CLI in llama.cpp and the results for me are below:

./speculative -m ./models/phind-codellama-34b-v2.Q4_K_M.gguf -md ./models/codellama-7b-instruct.Q3_K_M.gguf  -p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" -e -t 16 -n 256 -c 2048 -s 8 --draft 15 -b 512 **-ngld 35 -ngl 15**

And the results:

encoded   25 tokens in    1.490 seconds, speed:   16.780 t/s
decoded  280 tokens in   29.131 seconds, speed:    9.612 t/s

n_draft   = 27
n_predict = 280
n_drafted = 334
n_accept  = 245
accept    = 73.353%

draft:

llama_print_timings: load time =  538.74 ms
llama_print_timings: sample time =   522.77 ms / 1 runs (  522.77 ms per token, 1.91 tokens per second)
llama_print_timings: prompt eval time =   171.93 ms / 25 tokens  ( 6.88 ms per token, 145.41 tokens per second)
llama_print_timings: eval time =  7165.94 ms / 360 runs ( 19.91 ms per token, 50.24 tokens per second)
llama_print_timings:  total time = 30621.07 ms

target:

llama_print_timings: load time =  1015.12 ms
llama_print_timings: sample time =    91.54 ms / 280 runs ( 0.33 ms per token 3058.81 tokens per second)
llama_print_timings: prompt eval time = 20972.04 ms / 386 tokens  ( 54.33 ms per token, 18.41 tokens per second)
llama_print_timings: eval time =  1603.66 ms / 7 runs  ( 229.09 ms per token 4.37 tokens per second)
llama_print_timings: total time = 31163.91 ms

oobabooga · 2023-11-26T20:19:56Z

I have made some speculative decoding tests with the following models on my RTX 3090:

Target: wizardlm-70b-v1.0.Q4_K_S.gguf
Draft: tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf

With speculative, I get 3.41 tokens/second, while without it I get 2.08 tokens/second. That's a +64% increase.

This is the command that I used:

./speculative \
  -m ../models/wizardlm-70b-v1.0.Q4_K_S.gguf \
  -md ../models/tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf \
  -e \
  -t 6 \
  -tb 12 \
  -n 256 \
  -c 4096 \
  --draft 15 \
  -ngld 128 \
  -ngl 42 \
  -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Give me an example of Python script.\nASSISTANT:"

Having this feature available in llama-cpp-python would be amazing.

rangehow · 2024-01-08T09:11:09Z

This features looks so cool : ) Looking forward to this！

abetlen · 2024-01-23T17:16:24Z

#1120 is almost ready, need to do some more testing and perf benchmarks but it works now with prompt lookup decoding.

Andy1314Chen · 2024-03-12T01:16:46Z

#1120 is almost ready, need to do some more testing and perf benchmarks but it works now with prompt lookup decoding.

This features looks so cool! How can we make it support more speculative decoding? not just prompt lookup decoding.

abetlen changed the title ~~Please add the speculative inference function to llama_cpp.server~~ Speculative sampling Sep 8, 2023

abetlen added the enhancement New feature or request label Sep 8, 2023

abetlen pinned this issue Sep 14, 2023

abetlen mentioned this issue Sep 29, 2023

Roadmap for v0.2 #487

Open

9 tasks

abetlen unpinned this issue Nov 6, 2023

abetlen pinned this issue Nov 10, 2023

abetlen mentioned this issue Jan 24, 2024

Add speculative decoding #1120

Merged

abetlen closed this as completed in #1120 Jan 31, 2024

abetlen unpinned this issue Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative sampling #675

Speculative sampling #675

andriyanthon commented Sep 7, 2023

abetlen commented Sep 8, 2023

Chainfire commented Sep 13, 2023

galatolofederico commented Sep 13, 2023

gssci commented Sep 30, 2023

LynxPDA commented Oct 2, 2023 •

edited

Loading

oobabooga commented Nov 26, 2023

rangehow commented Jan 8, 2024

abetlen commented Jan 23, 2024

Andy1314Chen commented Mar 12, 2024

Speculative sampling #675

Speculative sampling #675

Comments

andriyanthon commented Sep 7, 2023

abetlen commented Sep 8, 2023

Chainfire commented Sep 13, 2023

galatolofederico commented Sep 13, 2023

gssci commented Sep 30, 2023

LynxPDA commented Oct 2, 2023 • edited Loading

oobabooga commented Nov 26, 2023

rangehow commented Jan 8, 2024

abetlen commented Jan 23, 2024

Andy1314Chen commented Mar 12, 2024

LynxPDA commented Oct 2, 2023 •

edited

Loading