EAGLE2: general part [2] #2129

yukavio · 2024-11-22T15:52:25Z

This PR is a part of #1498. The original PR was split into smaller PRs to facilitate review. This PR should be merged after #2128.

merrymercy · 2024-11-22T23:48:45Z

examples/runtime/engine/EAGLE_offline_batch_inference.py

@@ -0,0 +1,36 @@
+import sglang as sgl


Move this to sglang/test/srt/test_eagle.py

Make it a unit test with import unittest and assert the output matches the one w/o speculative decoding.

merrymercy · 2024-11-22T23:49:49Z

examples/runtime/engine/EAGLE_offline_batch_inference.py

+        eagle_topk=4,
+        num_draft_tokens=16,
+        speculative_algorithm="EAGLE",
+        mem_fraction_static=0.70,


Can we not change the mem_fraction_static?

examples/runtime/engine/EAGLE_offline_batch_inference.py

merrymercy · 2024-11-23T00:10:13Z

python/sglang/srt/layers/logits_processor.py

+    hidden_states: Optional[torch.Tensor] = None
+    # backup of next_token_logits when use cuda graph
+    # id(next_token_logits_bak) == id(next_token_logits)
+    next_token_logits_bak: Optional[torch.Tensor] = None


Why is this needed? Can you do a backup in the place that calls this function?

merrymercy · 2024-11-23T00:28:23Z

python/sglang/srt/layers/attention/__init__.py

+        encoder_lens: torch.Tensor = None,
+        spec_info: "SpecInput" = None,
+        is_draft_runner: bool = False,
+        forward_batch: ForwardBatch = None,


We do not want to pass in big objects such as spec_info and forward_batch. They contain too many fields, making it hard to reason which tensors are needed for cuda graph.
cuda graph is error-prone so we want to make the dependency very clear.

merrymercy · 2024-11-23T01:45:22Z

python/sglang/srt/server_args.py

+    # should not been set by cli, it is only a placeholder
+    # which would be set and used in model_runner
+    draft_runner_cache_size: int = None


do not add it here if it will be set later. Use global_server_args_dict instead

merrymercy · 2024-11-23T01:45:43Z

python/sglang/srt/managers/scheduler.py

@@ -555,15 +571,6 @@ def handle_generate_request(
                req.origin_input_ids_unpadded, req.image_inputs
            )

-            if len(req.origin_input_ids) > self.max_req_input_len:


do not delete this.

merrymercy · 2024-11-23T01:46:05Z

python/sglang/srt/managers/scheduler.py

+                server_args=server_args,
+                nccl_port=port_args.nccl_port,
+                target_worker=self.tp_worker,
+                dp_rank=dp_rank,


you can pass draft_runner_cache_size here instead of using server_args

merrymercy · 2024-11-23T01:51:44Z

python/sglang/srt/managers/scheduler.py

-                )
+                if self.server_args.speculative_algorithm.is_not_none():
+                    logits_output, next_token_ids, model_worker_batch = (
+                        self.draft_worker.forward_batch_speculative_generate(batch)


Can you let forward_batch_speculative_generate take model_worker_batch as inputs?
For anything you need, please copy it to model_worker_batch. We need this style for the overlap scheduler.

merrymercy · 2024-11-23T02:00:56Z

python/sglang/srt/managers/tp_worker.py

+            next_token_ids = None
+        else:
+            next_token_ids = self.model_runner.sample(logits_output, model_worker_batch)
+        model_worker_batch.spec_info = forward_batch.spec_info


This assignment should not be there.
The information flow should be ScheduleBatch -> ModelWorkerBatch -> ForwardBatch.
We will need this style to make all things work in overlap mode.

merrymercy · 2024-11-23T02:32:29Z

python/sglang/srt/model_executor/forward_batch_info.py

+    # Speculative Verify stage
+    SPEC_VERIFY = auto()
+    # Speculative draft Extend stage which after verify stage
+    SPEC_EXTEND = auto()


To make it clean whether it is draft or target model

SPEC_VERIFY -> TARGET_VERIFY SPEC_EXTEND -> DRAFT_EXTEND

kavioyu added 2 commits November 22, 2024 23:43

general part of eagle2

7c8cd10

change server args for eagle2

604c2f9

yukavio requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners November 22, 2024 15:52

merrymercy requested changes Nov 23, 2024

View reviewed changes

merrymercy reviewed Nov 23, 2024

View reviewed changes

merrymercy closed this Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EAGLE2: general part [2] #2129

EAGLE2: general part [2] #2129

yukavio commented Nov 22, 2024

merrymercy Nov 22, 2024

merrymercy Nov 22, 2024

merrymercy Nov 23, 2024

merrymercy Nov 23, 2024

merrymercy Nov 23, 2024

merrymercy Nov 23, 2024

merrymercy Nov 23, 2024

merrymercy Nov 23, 2024

merrymercy Nov 23, 2024

merrymercy Nov 23, 2024

EAGLE2: general part [2] #2129

EAGLE2: general part [2] #2129

Conversation

yukavio commented Nov 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment