Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EAGLE2: general part [2] #2129

Closed
wants to merge 2 commits into from
Closed

Conversation

yukavio
Copy link
Collaborator

@yukavio yukavio commented Nov 22, 2024

This PR is a part of #1498. The original PR was split into smaller PRs to facilitate review. This PR should be merged after #2128.

@@ -0,0 +1,36 @@
import sglang as sgl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Move this to sglang/test/srt/test_eagle.py
  2. Make it a unit test with import unittest and assert the output matches the one w/o speculative decoding.

eagle_topk=4,
num_draft_tokens=16,
speculative_algorithm="EAGLE",
mem_fraction_static=0.70,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not change the mem_fraction_static?

hidden_states: Optional[torch.Tensor] = None
# backup of next_token_logits when use cuda graph
# id(next_token_logits_bak) == id(next_token_logits)
next_token_logits_bak: Optional[torch.Tensor] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Can you do a backup in the place that calls this function?

encoder_lens: torch.Tensor = None,
spec_info: "SpecInput" = None,
is_draft_runner: bool = False,
forward_batch: ForwardBatch = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to pass in big objects such as spec_info and forward_batch. They contain too many fields, making it hard to reason which tensors are needed for cuda graph.
cuda graph is error-prone so we want to make the dependency very clear.

Comment on lines +148 to +150
# should not been set by cli, it is only a placeholder
# which would be set and used in model_runner
draft_runner_cache_size: int = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not add it here if it will be set later. Use global_server_args_dict instead

@@ -555,15 +571,6 @@ def handle_generate_request(
req.origin_input_ids_unpadded, req.image_inputs
)

if len(req.origin_input_ids) > self.max_req_input_len:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not delete this.

server_args=server_args,
nccl_port=port_args.nccl_port,
target_worker=self.tp_worker,
dp_rank=dp_rank,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can pass draft_runner_cache_size here instead of using server_args

)
if self.server_args.speculative_algorithm.is_not_none():
logits_output, next_token_ids, model_worker_batch = (
self.draft_worker.forward_batch_speculative_generate(batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you let forward_batch_speculative_generate take model_worker_batch as inputs?
For anything you need, please copy it to model_worker_batch. We need this style for the overlap scheduler.

next_token_ids = None
else:
next_token_ids = self.model_runner.sample(logits_output, model_worker_batch)
model_worker_batch.spec_info = forward_batch.spec_info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment should not be there.
The information flow should be ScheduleBatch -> ModelWorkerBatch -> ForwardBatch.
We will need this style to make all things work in overlap mode.

Comment on lines +81 to +84
# Speculative Verify stage
SPEC_VERIFY = auto()
# Speculative draft Extend stage which after verify stage
SPEC_EXTEND = auto()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it clean whether it is draft or target model

SPEC_VERIFY -> TARGET_VERIFY
SPEC_EXTEND -> DRAFT_EXTEND

@merrymercy merrymercy closed this Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants