Speculative Decoding? #4286

akumaburn · 2023-12-01T18:18:26Z

I am writing to propose the integration of speculative decoding into the llama.cpp project. Given the growing need for efficient and fast inference in large language models (LLMs), incorporating speculative decoding could significantly enhance the performance of llama.cpp in terms of speed and computational resource utilization.

Current State:
llama.cpp currently offers robust support for various features such as different integer quantization levels and GPU backend support, optimized for both Apple silicon and x86 architectures. However, the inference process, especially for larger models, can be computationally demanding and time-consuming.

Proposal:
Implement speculative decoding in llama.cpp. This technique, which allows for the generation of multiple tokens from each transformer call, can greatly accelerate the decoding process. Given that llama.cpp is used for running the LLaMA model, this enhancement could make it more efficient in real-world applications where quick response times are crucial.

Benefits:

Speed: By enabling faster generation of multiple tokens, inference times could be significantly reduced.
Efficiency: Improved utilization of GPU hardware, especially beneficial for scenarios where batch sizes vary.
Broader Applicability: Makes llama.cpp more suitable for real-time applications or environments with limited computational resources.

Implementation Considerations:

Study the optimal speculation length based on the batch sizes commonly used with llama.cpp.
Ensure compatibility with existing features like integer quantization levels and GPU backend support.
Maintain the performance standards on various platforms, including Apple silicon and x86 architectures.

I believe this feature would be a valuable addition to llama.cpp, enhancing its utility and performance. Thank you for considering this request.

References:
https://medium.com/@TitanML/in-the-fast-lane-speculative-decoding-10x-larger-model-no-extra-cost-f33ea39d065a

The text was updated successfully, but these errors were encountered:

BarfingLemurs · 2023-12-01T20:53:45Z

Yes, its already supported:

#2926
#3624

Calandiel · 2023-12-04T21:19:48Z

Yes, its already supported:

#2926 #3624

Only in a single executable, AFAIK. What about, say, the server?

github-actions · 2024-04-03T01:14:51Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

akumaburn added the enhancement New feature or request label Dec 1, 2023

mscheong01 mentioned this issue Mar 5, 2024

Support speculative decoding in server example #5877

Open

4 tasks

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative Decoding? #4286

Speculative Decoding? #4286

akumaburn commented Dec 1, 2023 •

edited

Loading

BarfingLemurs commented Dec 1, 2023

Calandiel commented Dec 4, 2023 •

edited

Loading

github-actions bot commented Apr 3, 2024

Speculative Decoding? #4286

Speculative Decoding? #4286

Comments

akumaburn commented Dec 1, 2023 • edited Loading

BarfingLemurs commented Dec 1, 2023

Calandiel commented Dec 4, 2023 • edited Loading

github-actions bot commented Apr 3, 2024

akumaburn commented Dec 1, 2023 •

edited

Loading

Calandiel commented Dec 4, 2023 •

edited

Loading