Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speculative Decoding? #4286

Closed
akumaburn opened this issue Dec 1, 2023 · 3 comments
Closed

Speculative Decoding? #4286

akumaburn opened this issue Dec 1, 2023 · 3 comments
Labels
enhancement New feature or request stale

Comments

@akumaburn
Copy link

akumaburn commented Dec 1, 2023

I am writing to propose the integration of speculative decoding into the llama.cpp project. Given the growing need for efficient and fast inference in large language models (LLMs), incorporating speculative decoding could significantly enhance the performance of llama.cpp in terms of speed and computational resource utilization.

Current State:
llama.cpp currently offers robust support for various features such as different integer quantization levels and GPU backend support, optimized for both Apple silicon and x86 architectures. However, the inference process, especially for larger models, can be computationally demanding and time-consuming.

Proposal:
Implement speculative decoding in llama.cpp. This technique, which allows for the generation of multiple tokens from each transformer call, can greatly accelerate the decoding process. Given that llama.cpp is used for running the LLaMA model, this enhancement could make it more efficient in real-world applications where quick response times are crucial.

Benefits:

  • Speed: By enabling faster generation of multiple tokens, inference times could be significantly reduced.
  • Efficiency: Improved utilization of GPU hardware, especially beneficial for scenarios where batch sizes vary.
  • Broader Applicability: Makes llama.cpp more suitable for real-time applications or environments with limited computational resources.

Implementation Considerations:

  • Study the optimal speculation length based on the batch sizes commonly used with llama.cpp.
  • Ensure compatibility with existing features like integer quantization levels and GPU backend support.
  • Maintain the performance standards on various platforms, including Apple silicon and x86 architectures.

I believe this feature would be a valuable addition to llama.cpp, enhancing its utility and performance. Thank you for considering this request.

References:
https://medium.com/@TitanML/in-the-fast-lane-speculative-decoding-10x-larger-model-no-extra-cost-f33ea39d065a

@akumaburn akumaburn added the enhancement New feature or request label Dec 1, 2023
@BarfingLemurs
Copy link
Contributor

Yes, its already supported:

#2926
#3624

@Calandiel
Copy link

Calandiel commented Dec 4, 2023

Yes, its already supported:

#2926 #3624

Only in a single executable, AFAIK. What about, say, the server?

Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

3 participants