-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speculative decoding with EAGLE2 #1498
Conversation
hello, whether this code supports the multiple request sepc? |
Hi @yukavio Is there any recent progress or plan for this? Do you plan to support deepseek-v2? |
Yes, I will support it. |
I have implemented the draft and verify stages and tested them on a single request. I am trying to migrate my code to the main branch due to the main branch has some significant changes about the controller and worker which are very important for my implementation. My plan: |
THX, yukavio.I have some suggestions for this pr: 1. further more support more models, e.g. i think we should pop the eagle head from draft_extend_input_queue so we dont modify the origin llama model file. 2.i dont understand why we need so many SpecInfoPipline queue, spec only in decoding stage,if we dont need the draft_extend_input_queue at least. |
Also i have another question, in the pr model_runner.py init kv cache twice in different tpworker, this results in the oom in gpu, if we merge the draft and tagret kv cache to increase the gpu utilization? |
I have migrated the code to another branch :https://github.com/yukavio/sglang/tree/new_spec_infer and I will update the code to this PR lately. In the new implementation, I choose to run the draft worker and target model worker in one process instead of using many queues in SpecInfoPipline to communicate with draft work process and target process. For memory management, I've fixed this bug in the new branch to ensure it won't raise an error during testing. But it may not be very efficient and I will improve it after I have finish the remained work in the plan. |
@yukavio Hi yukavio Recently, SGLang has undergone some refactoring work. You need to merge the latest main to resolve the corresponding conflicts. Thanks! |
@yukavio Hi, when is this PR expected to be merged? I've trained a draft model and am eager to try it out. |
OK, I am fixing some bugs in batch inference now. I will update the code to main branch after fixing them. Personally, I think the updated code can be used as the first version. The community could review this version of the implementation. |
If all goes well I will finish the first version of development this week. When to merge into the main branch depends on community review and opinions. |
This comment was marked as duplicate.
This comment was marked as duplicate.
@yukavio Is CLI startup not supported currently? I encountered this error: File "/opt/tiger/sglang/python/sglang/srt/server_args.py", line 613, in <dictcomp>
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'draft_runner_cache_size' |
Add a memo: cutex should be added to dependency list of Sglang after review. |
@yukavio Can you resolve the conflicts? |
Fixed. Could you please help me to trigger the CI? |
CI has failed due to timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an initial code style review. I will follow up with a more careful logic review. We want to merge this soon.
Some guidelines:
- Split the big PR into smaller PRs. Merge small unreachable code first. (e.g., introduce two forward modes).
- Minimize the code change of the common scheduler and move most things into self-contained eagle-specific files
Sorry for the frequent changes on the main branch. We did a lot of refactoring to enable the overlap scheduler by default. That refactor has been finished (7d671e4) so we do not expect big changes after that. This PR is now our top priority.
kernels = cutex.SourceModule( | ||
""" | ||
//cuda | ||
__global__ void build_tree(Tensor<long, 2> parent_list, Tensor<long, 2> selected_index, Tensor<int, 1> verified_seq_len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to do this with triton?
positions = forward_batch.positions | ||
if positions is None: | ||
positions = clamp_position(forward_batch.seq_lens) | ||
self.positions[:raw_num_token].copy_(positions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yukavio Do we need to assign spec_info.custom_mask to self.cuda_graph_custom_mask before replay? Looks like self.cuda_graph_custom_mask is not used when cuda graph replay()
logits_output.next_token_logits = logits_output.next_token_logits_bak[ | ||
accept_index | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logits_output.next_token_logits = logits_output.next_token_logits_bak[ | |
accept_index | |
] | |
self.next_token_logits_back = next_token_logits_back | |
logits_output.next_token_logits = logits_output.next_token_logits[ | |
accept_index | |
] |
encoder_lens: torch.Tensor = None, | ||
spec_info: "SpecInput" = None, | ||
is_draft_runner: bool = False, | ||
forward_batch: ForwardBatch = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style. If one argument can be None, we should use "Optional[xx] instead of xxx"
encoder_lens: torch.Tensor = None, | |
spec_info: "SpecInput" = None, | |
is_draft_runner: bool = False, | |
forward_batch: ForwardBatch = None, | |
encoder_lens: Optional[torch.Tensor] = None, | |
spec_info: Optional["SpecInput"] = None, | |
is_draft_runner: bool = False, | |
forward_batch: Optional[ForwardBatch] = None, |
req_pool_indices: torch.Tensor, | ||
seq_lens: torch.Tensor, | ||
seq_lens_sum: int, | ||
encoder_lens: Optional[torch.Tensor] = None, | ||
encoder_lens=None, | ||
forward_batch=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct the type annotation, use Optional
@@ -130,8 +135,37 @@ def init_forward_metadata(self, forward_batch: ForwardBatch): | |||
forward_batch.seq_lens_sum, | |||
decode_wrappers=None, | |||
encoder_lens=forward_batch.encoder_lens, | |||
forward_batch=forward_batch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need this for `indices_updater_decode
@@ -164,52 +199,102 @@ def init_cuda_graph_state(self, max_bs: int): | |||
cuda_graph_kv_indices.clone() for _ in range(self.num_wrappers - 1) | |||
] | |||
|
|||
self.cuda_graph_custom_mask = torch.zeros( | |||
(max_bs * (self.max_context_len + 7) // 8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is a rounding needed here?
paged_kv_indices_buffer=self.cuda_graph_kv_indices[i], | ||
paged_kv_last_page_len_buffer=self.kv_last_page_len[:bs], | ||
# speculative decoding verify stage | ||
if spec_info is not None and not is_draft_runner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use forward_mode.is_target_verify() as a condition.
req_pool_indices: torch.Tensor, | ||
seq_lens: torch.Tensor, | ||
encoder_lens: torch.Tensor = None, | ||
spec_info: SpecInput = None, | ||
is_draft_runner: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_draft_runner
seems not useful.
# speculative decoding verify stage | ||
if spec_info is not None and not is_draft_runner: | ||
for i in range(self.num_wrappers): | ||
decode_wrappers.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call it prefill_wrappers
encoder_lens=None, | ||
forward_batch=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep type annotations.
hidden_states: Optional[torch.Tensor] = None | ||
# backup of next_token_logits when use cuda graph | ||
# id(next_token_logits_bak) == id(next_token_logits) | ||
next_token_logits_bak: Optional[torch.Tensor] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not need this.
@@ -96,7 +96,7 @@ def __init__(self, server_args, port_args) -> None: | |||
else: | |||
# This port is checked free in PortArgs.init_new. | |||
# We hold it first so that the next dp worker gets a different port | |||
sockets.append(bind_port(tmp_port_args.nccl_port)) | |||
sockets.append(bind_port(tmp_port_args.nccl_port[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use another variable instead of making it a list.
Motivation
Accelerate the model inference by speculative inference (EAGLE2).
Modifications
It will be provided soon.
Checklist