-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Implement vLLM V1 [1/N] #9289
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WoosukKwon thanks for the hard work on this, looks like you made good progress. Left some comments/clarifications.
vllm/entrypoints/llm.py
Outdated
# FIXME: | ||
engine_args.max_num_seqs = max(engine_args.max_num_seqs, 2048) | ||
engine_args.enable_chunked_prefill = False | ||
self.llm_engine = LLMEngineV1.from_engine_args( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this where we want to switch between vllm_v1 and the old vllm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding an if
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced a new env variable VLLM_USE_V1
, which is 0 by default. By setting this env variable, users can use the V1 code path.
vllm_v1/worker/gpu_model_runner.py
Outdated
|
||
# Calculate the slot mapping. | ||
block_numbers = self.persistent_batch.block_table_cpu_tensor.flatten()[ | ||
token_indices // self.block_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How having M inside token indices (to separate requests) affects the block_numbers we get here? Isn't this results in a "jump"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done most of the reviews. Has no brain left for GPUModelRunner
. Will look into that more tomorrow.
vllm/entrypoints/llm.py
Outdated
# FIXME: | ||
engine_args.max_num_seqs = max(engine_args.max_num_seqs, 2048) | ||
engine_args.enable_chunked_prefill = False | ||
self.llm_engine = LLMEngineV1.from_engine_args( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding an if
here?
vllm_v1/worker/gpu_worker.py
Outdated
def _get_cache_block_size( | ||
cache_config: CacheConfig, | ||
model_config: ModelConfig, | ||
parallel_config: ParallelConfig, | ||
) -> int: | ||
head_size = model_config.get_head_size() | ||
num_heads = model_config.get_num_kv_heads(parallel_config) | ||
num_attention_layers = model_config.get_num_attention_layers( | ||
parallel_config) | ||
|
||
key_cache_block = cache_config.block_size * num_heads * head_size | ||
value_cache_block = key_cache_block | ||
total = num_attention_layers * (key_cache_block + value_cache_block) | ||
if cache_config.cache_dtype == "auto": | ||
dtype = model_config.dtype | ||
else: | ||
dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype] | ||
dtype_size = get_dtype_size(dtype) | ||
return dtype_size * total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not in this PR/re-arch, but eventually should we move this to the model code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yes? I actually didn't care much because the code is small and didn't bring any complexity.
vllm_v1/worker/gpu_model_runner.py
Outdated
self.top_p = torch.empty((max_num_reqs, ), | ||
dtype=torch.float32, | ||
device=device) | ||
self.top_p_cpu_tensor = torch.empty((max_num_reqs, ), | ||
dtype=torch.float32, | ||
device="cpu", | ||
pin_memory=pin_memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an abstraction for logic around self.x
, self.x_cpu_tensor
, self.x_cpu
, self.x_reqs
for different x
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate more?
vllm_v1/worker/gpu_model_runner.py
Outdated
scheduler_output: "SchedulerOutput", | ||
) -> ModelRunnerOutput: | ||
self._update_states(scheduler_output) | ||
inputs = self._prepare_inputs(scheduler_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need scheduler_output
to prepare the inputs if we cache all the request states in the model runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduler output contains 1) the scheduling decision (req id -> num_tokens), and 2) all the data for new requests, and 3) new block ids for in-flight requests.
vllm_v1/worker/gpu_model_runner.py
Outdated
# NOTE: CPU-GPU synchronization happens here. | ||
sampled_token_ids = sampler_output.sampled_token_ids.cpu() | ||
sampled_token_ids_list = sampled_token_ids.tolist() | ||
# TODO: Optimize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be a bit more specific on what to optimize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more comments.
vllm/entrypoints/llm.py
Outdated
from vllm_v1.engine.llm_engine import LLMEngine as LLMEngineV1 | ||
from vllm_v1.outputs import RequestOutput as RequestOutputV1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the interface is compatible, would the following be easier?
if USE_V1:
from vllm_v1.engine.llm_engine import LLMEngine
from vllm_v1.outputs import RequestOutput
else:
from vllm.engine.llm_engine import LLMEngine
from vllm.outputs import RequestOutput
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I introduced the VLLM_USE_V1
env variable and added a similar if statement. PTAL.
vllm_v1/core/kv_cache_manager.py
Outdated
def get_computed_blocks(self, request: Request) -> List[int]: | ||
if not self.enable_caching: | ||
# No prefix caching. | ||
return [] | ||
# TODO(woosuk): Implement hash-based caching. | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to think about before implementing hash-based caching: Where to calculate the hash?
In block manager v1, the hash was calculated in the sequence (aka Request); while in block manager v2, the hash is calculated in the block manager. Calculating hash in Request makes sure the hash will be calculated only once during the Request life cycle, but calculating hash in block manager makes more sense because the hash should attach to cache blocks instead of sequences.
cc @rickyyx who is working on prefix-caching aware scheduler in v0.
vllm_v1/core/kv_cache_manager.py
Outdated
num_blocks = cdiv(request.num_computed_tokens + num_tokens, | ||
self.block_size) | ||
req_block_ids = self.req_to_block_ids[request.request_id] | ||
num_new_blocks = num_blocks - len(req_block_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be incrementally calculated? The only missing information here to determine how many new blocks we need is the number of empty slots of the last block, and block manager should have this information, so maybe we could do something like the following
req_block_ids = self.req_to_block_ids[request.request_id]
empty_slots = req_block_ids[-1].empty_slots
if num_tokens <= empty_slots:
# No new block is needed.
return []
num_new_blocks = (num_tokens - empty_slots) // self.block_size
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, could you explain why do you prefer that over this implementation? I personally found the current implementation more concise and intuitive (if I didn't miss anything).
vllm_v1/worker/gpu_model_runner.py
Outdated
self.max_num_tokens = scheduler_config.max_num_batched_tokens | ||
|
||
# Lazy initialization | ||
self.model: nn.Module # Set after load_model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question. Is it the new feature supported by later Python version? I got an error with Python 3.9:
>>> class A:
... def __init__(self):
... self.data: str
...
>>> a = A()
>>> a.data
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'A' object has no attribute 'data'
vllm_v1/worker/gpu_model_runner.py
Outdated
if removed_req_indices: | ||
self.persistent_batch.condense(removed_req_indices) | ||
|
||
def _prepare_inputs(self, scheduler_output: "SchedulerOutput"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is unfortunately already complicate. I can't imagine how complicate it will be once other features (e.g., LoRA, multi-modal) are added...it seems also hard to extend these features after this function such as
model_inputs = self._prepare_inputs(...)
model_inputs = slef._prepare_lora_inputs(model_inputs)
@zhuohan123 Can you please take another look? |
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: Alvant <alvasian@yandex.ru>
Signed-off-by: Erkin Sagiroglu <erkin@infra-aipipeline-1-at1-prox-prod-a.ipa.corp.telnyx.com>
Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
# OPTIMIZATION: Cache the request output and update it incrementally. | ||
# This is used to avoid creating a new RequestOutput object every step. | ||
# Request id -> RequestOutput | ||
self.request_outputs: Dict[str, RequestOutput] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One (very late note) --> this caching may cause a bug. With AsyncLLMEngine, we will put these RequestOutput
objects into the per request output queues which the OpenAI server then uses to make the objects sent back to the client. If the LLMEngine gets ahead of the AsyncLLMEngine, we will mutate the object before the OpenAI server has a chance to make its output.
Signed-off-by: qishuai <ferdinandzhong@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
No description provided.