-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Classifier-Free Guidance #5825
Comments
Up |
Hi @Vermeille. Great work. For implementation in vLLM, this can be done at a similar layer to Speculative Decoding:
The primary benefit of this design is that you can manage two block tables in the existing LLMEngine and scheduler without any modification. This is done in speculative decoding (with draft model) by splitting the KV cache space evenly into two equally-sized regions [1]. Then the same block table can work for both models. You can actually prototype this relatively straightforwardly; the only major missing piece is you will need to have one of the Workers not load weights (e.g. weight loading is shared with other worker). Alternatively, you can use a single worker and modify block tables with a constant offset so that you have independent KV cache. Secondary benefits of this design are hardware agnosticity; your implementation can work with non-nvidia non-amd hardware backends. [1] vllm/vllm/spec_decode/spec_decode_worker.py Lines 255 to 274 in 70c232f
|
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Motivation.
I am one of the authors of the paper Stay On Topic with Classifier-Free Guidance ( https://openreview.net/forum?id=RiM3cl9MdK¬eId=s1BXLL1YZD ) who has been nominated as ICML'24 Spotlight Paper. CFG is a sampling technique that allows LLMs to follow the prompt more closely at the cost of two forward passes per token as well as 2 kv caches. CFG brings non trivial improvements overall over standard benchmarks.
I would be extremely interested in having CFG implemented into vLLM. If possible, I would like to get a bit of guidance into the vLLM code base.
Proposed Change.
CFG contrasts the next token logits between two different prompt (a "positive prompt" a, and a "negative prompt" or "unconditional" b)
Here is the pseudo algorithm
As you can see this needs two concurrent kv-caches for an efficient implementation. I tried looking for how Speculative Decoding was implemented but this was quite complex, more than CFG needs.
Feedback Period.
No response
CC List.
No response
Any Other Things.
I am willing to implement it myself given enough guidance as this looks like a non trivial thing to implement. I think something similar to / reusing bits of Speculative Decoding might be used but the code is non trivial.
The text was updated successfully, but these errors were encountered: