[RFC]: Classifier-Free Guidance #5825

Vermeille · 2024-06-25T14:29:24Z

Motivation.

I am one of the authors of the paper Stay On Topic with Classifier-Free Guidance ( https://openreview.net/forum?id=RiM3cl9MdK&noteId=s1BXLL1YZD ) who has been nominated as ICML'24 Spotlight Paper. CFG is a sampling technique that allows LLMs to follow the prompt more closely at the cost of two forward passes per token as well as 2 kv caches. CFG brings non trivial improvements overall over standard benchmarks.

I would be extremely interested in having CFG implemented into vLLM. If possible, I would like to get a bit of guidance into the vLLM code base.

Proposed Change.

CFG contrasts the next token logits between two different prompt (a "positive prompt" a, and a "negative prompt" or "unconditional" b)

Here is the pseudo algorithm

while we sample:
    logits_a = log_softmax(model(prompt_a))
    logits_b = log_softmax(model(prompt_b))
    logits = logits_b + cfg_scale * (logits_a - logits_b)
    next_token = sample_from(logits)
    prompt_a.append(next_token)
    prompt_b.append(next_token)

As you can see this needs two concurrent kv-caches for an efficient implementation. I tried looking for how Speculative Decoding was implemented but this was quite complex, more than CFG needs.

Feedback Period.

No response

CC List.

No response

Any Other Things.

I am willing to implement it myself given enough guidance as this looks like a non trivial thing to implement. I think something similar to / reusing bits of Speculative Decoding might be used but the code is non trivial.

The text was updated successfully, but these errors were encountered:

Vermeille · 2024-06-28T13:18:28Z

Up

cadedaniel · 2024-07-09T06:35:53Z

Hi @Vermeille. Great work.

For implementation in vLLM, this can be done at a similar layer to Speculative Decoding:

LLMEngine
CFGWorker
< logic which calls the underlying worker twice, does logit math, samples >
Worker (2x)

The primary benefit of this design is that you can manage two block tables in the existing LLMEngine and scheduler without any modification. This is done in speculative decoding (with draft model) by splitting the KV cache space evenly into two equally-sized regions [1]. Then the same block table can work for both models. You can actually prototype this relatively straightforwardly; the only major missing piece is you will need to have one of the Workers not load weights (e.g. weight loading is shared with other worker).

Alternatively, you can use a single worker and modify block tables with a constant offset so that you have independent KV cache.

Secondary benefits of this design are hardware agnosticity; your implementation can work with non-nvidia non-amd hardware backends.

[1]

vllm/vllm/spec_decode/spec_decode_worker.py

Lines 255 to 274 in 70c232f

    
               def determine_num_available_blocks(self) -> Tuple[int, int]: 
        
                   """Determine the number of cache blocks to use. 
        
                   This is done by profiling the scorer model (which is typically the 
        
                   larger of the two). Then the total memory which would be used by the 
        
                   scorer cache is divided evenly between the proposer and scorer model KV, 
        
                   such that the number of blocks is equal in both KV caches. 
        
                   """ 
        
                   num_gpu_blocks, num_cpu_blocks = ( 
        
                       self.scorer_worker.determine_num_available_blocks()) 
        
                   scorer_cache_block_size_bytes = ( 
        
                       self.scorer_worker.get_cache_block_size_bytes()) 
        
                   proposer_cache_block_size_bytes = ( 
        
                       self.proposer_worker.get_cache_block_size_bytes()) 
        
                   new_num_gpu_blocks = split_num_cache_blocks_evenly( 
        
                       scorer_cache_block_size_bytes, proposer_cache_block_size_bytes, 
        
                       num_gpu_blocks) 
        
                   return new_num_gpu_blocks, num_cpu_blocks

github-actions · 2024-10-25T02:04:01Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-11-24T02:09:25Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Vermeille added the RFC label Jun 25, 2024

cadedaniel mentioned this issue Jul 22, 2024

[core] Sampling controller interface #6273

Open

zhaoyinglia mentioned this issue Aug 1, 2024

Add Classifier free guidance #7016

Closed

zhaoyinglia mentioned this issue Sep 27, 2024

[NewFeature] add classifier free guidance base on vllm FlagOpen/FlagScale#225

Merged

github-actions bot added the stale label Oct 25, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Classifier-Free Guidance #5825

[RFC]: Classifier-Free Guidance #5825

Vermeille commented Jun 25, 2024

Vermeille commented Jun 28, 2024

cadedaniel commented Jul 9, 2024 •

edited

Loading

github-actions bot commented Oct 25, 2024

github-actions bot commented Nov 24, 2024

[RFC]: Classifier-Free Guidance #5825

[RFC]: Classifier-Free Guidance #5825

Comments

Vermeille commented Jun 25, 2024

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Vermeille commented Jun 28, 2024

cadedaniel commented Jul 9, 2024 • edited Loading

github-actions bot commented Oct 25, 2024

github-actions bot commented Nov 24, 2024

cadedaniel commented Jul 9, 2024 •

edited

Loading