-
Notifications
You must be signed in to change notification settings - Fork 988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EAGLE2: general part [2] #2129
EAGLE2: general part [2] #2129
Conversation
@@ -0,0 +1,36 @@ | |||
import sglang as sgl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Move this to
sglang/test/srt/test_eagle.py
- Make it a unit test with
import unittest
and assert the output matches the one w/o speculative decoding.
eagle_topk=4, | ||
num_draft_tokens=16, | ||
speculative_algorithm="EAGLE", | ||
mem_fraction_static=0.70, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not change the mem_fraction_static
?
hidden_states: Optional[torch.Tensor] = None | ||
# backup of next_token_logits when use cuda graph | ||
# id(next_token_logits_bak) == id(next_token_logits) | ||
next_token_logits_bak: Optional[torch.Tensor] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Can you do a backup in the place that calls this function?
encoder_lens: torch.Tensor = None, | ||
spec_info: "SpecInput" = None, | ||
is_draft_runner: bool = False, | ||
forward_batch: ForwardBatch = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not want to pass in big objects such as spec_info
and forward_batch
. They contain too many fields, making it hard to reason which tensors are needed for cuda graph.
cuda graph is error-prone so we want to make the dependency very clear.
# should not been set by cli, it is only a placeholder | ||
# which would be set and used in model_runner | ||
draft_runner_cache_size: int = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not add it here if it will be set later. Use global_server_args_dict
instead
@@ -555,15 +571,6 @@ def handle_generate_request( | |||
req.origin_input_ids_unpadded, req.image_inputs | |||
) | |||
|
|||
if len(req.origin_input_ids) > self.max_req_input_len: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not delete this.
server_args=server_args, | ||
nccl_port=port_args.nccl_port, | ||
target_worker=self.tp_worker, | ||
dp_rank=dp_rank, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can pass draft_runner_cache_size
here instead of using server_args
) | ||
if self.server_args.speculative_algorithm.is_not_none(): | ||
logits_output, next_token_ids, model_worker_batch = ( | ||
self.draft_worker.forward_batch_speculative_generate(batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you let forward_batch_speculative_generate
take model_worker_batch
as inputs?
For anything you need, please copy it to model_worker_batch
. We need this style for the overlap scheduler.
next_token_ids = None | ||
else: | ||
next_token_ids = self.model_runner.sample(logits_output, model_worker_batch) | ||
model_worker_batch.spec_info = forward_batch.spec_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment should not be there.
The information flow should be ScheduleBatch -> ModelWorkerBatch -> ForwardBatch
.
We will need this style to make all things work in overlap mode.
# Speculative Verify stage | ||
SPEC_VERIFY = auto() | ||
# Speculative draft Extend stage which after verify stage | ||
SPEC_EXTEND = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it clean whether it is draft or target model
SPEC_VERIFY -> TARGET_VERIFY
SPEC_EXTEND -> DRAFT_EXTEND
This PR is a part of #1498. The original PR was split into smaller PRs to facilitate review. This PR should be merged after #2128.