Speculative decoding with EAGLE2 #1498

yukavio · 2024-09-24T07:46:54Z

Motivation

Accelerate the model inference by speculative inference (EAGLE2).

Modifications

It will be provided soon.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

Qiubo1 · 2024-09-26T05:45:16Z

hello, whether this code supports the multiple request sepc?

fengyang95 · 2024-10-09T13:33:50Z

Hi @yukavio Is there any recent progress or plan for this? Do you plan to support deepseek-v2?

yukavio · 2024-10-11T12:49:41Z

hello, whether this code supports the multiple request sepc?

Yes, I will support it.

yukavio · 2024-10-11T12:58:29Z

Hi @yukavio Is there any recent progress or plan for this? Do you plan to support deepseek-v2?

I have implemented the draft and verify stages and tested them on a single request. I am trying to migrate my code to the main branch due to the main branch has some significant changes about the controller and worker which are very important for my implementation.
I do not plan to support deepseek-v2 due to there is no open-source draft model of deepseek-v2 with eagle2 for testing.
I plan to implement this feature based on llama currently.

My plan:
Migrate code and test it: 1-2 days.
Implement remained code of single request speculative decoding: half or one week.
Implement remained code of speculative decoding with batch: one or two week.

Qiubo1 · 2024-10-16T03:47:42Z

THX, yukavio.I have some suggestions for this pr: 1. further more support more models, e.g. i think we should pop the eagle head from draft_extend_input_queue so we dont modify the origin llama model file. 2.i dont understand why we need so many SpecInfoPipline queue, spec only in decoding stage,if we dont need the draft_extend_input_queue at least.

Qiubo1 · 2024-10-16T08:40:21Z

Also i have another question, in the pr model_runner.py init kv cache twice in different tpworker, this results in the oom in gpu, if we merge the draft and tagret kv cache to increase the gpu utilization?

yukavio · 2024-10-16T10:57:32Z

Also i have another question, in the pr model_runner.py init kv cache twice in different tpworker, this results in the oom in gpu, if we merge the draft and tagret kv cache to increase the gpu utilization?

I have migrated the code to another branch :https://github.com/yukavio/sglang/tree/new_spec_infer and I will update the code to this PR lately. In the new implementation, I choose to run the draft worker and target model worker in one process instead of using many queues in SpecInfoPipline to communicate with draft work process and target process.

For memory management, I've fixed this bug in the new branch to ensure it won't raise an error during testing. But it may not be very efficient and I will improve it after I have finish the remained work in the plan.

zhyncs · 2024-10-21T04:09:32Z

@yukavio Hi yukavio Recently, SGLang has undergone some refactoring work. You need to merge the latest main to resolve the corresponding conflicts. Thanks!

fengyang95 · 2024-10-21T07:02:36Z

@yukavio Hi, when is this PR expected to be merged? I've trained a draft model and am eager to try it out.

yukavio · 2024-10-21T11:22:46Z

@yukavio Hi yukavio Recently, SGLang has undergone some refactoring work. You need to merge the latest main to resolve the corresponding conflicts. Thanks!

OK, I am fixing some bugs in batch inference now. I will update the code to main branch after fixing them. Personally, I think the updated code can be used as the first version. The community could review this version of the implementation.

yukavio · 2024-10-21T11:24:23Z

@yukavio Hi, when is this PR expected to be merged? I've trained a draft model and am eager to try it out.

If all goes well I will finish the first version of development this week. When to merge into the main branch depends on community review and opinions.

fengyang95 · 2024-10-23T15:07:20Z

@yukavio Is CLI startup not supported currently? I encountered this error:

File "/opt/tiger/sglang/python/sglang/srt/server_args.py", line 613, in <dictcomp>
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'draft_runner_cache_size'

yukavio · 2024-11-14T11:06:53Z

Add a memo: cutex should be added to dependency list of Sglang after review.

merrymercy · 2024-11-17T02:27:40Z

@yukavio Can you resolve the conflicts?

yukavio · 2024-11-18T03:27:22Z

@yukavio Can you resolve the conflicts?

Fixed. Could you please help me to trigger the CI?

yukavio · 2024-11-18T09:04:21Z

CI has failed due to timeout.

merrymercy

I did an initial code style review. I will follow up with a more careful logic review. We want to merge this soon.

Some guidelines:

Split the big PR into smaller PRs. Merge small unreachable code first. (e.g., introduce two forward modes).
Minimize the code change of the common scheduler and move most things into self-contained eagle-specific files

Sorry for the frequent changes on the main branch. We did a lot of refactoring to enable the overlap scheduler by default. That refactor has been finished (7d671e4) so we do not expect big changes after that. This PR is now our top priority.

examples/runtime/engine/EAGLE_offline_batch_inference.py

python/sglang/srt/layers/attention/__init__.py

python/sglang/srt/layers/attention/flashinfer_backend.py

python/sglang/srt/layers/attention/flashinfer_utils.py

python/sglang/srt/layers/logits_processor.py

python/sglang/srt/server_args.py

python/sglang/srt/models/llama_eagle.py

merrymercy · 2024-11-22T00:58:28Z

python/sglang/srt/speculative/build_eagle_tree.py

+kernels = cutex.SourceModule(
+    """
+//cuda
+__global__ void build_tree(Tensor<long, 2> parent_list, Tensor<long, 2> selected_index, Tensor<int, 1> verified_seq_len,


is it possible to do this with triton?

python/sglang/srt/speculative/eagle_utils.py

jjjjohnson · 2024-11-22T09:37:07Z

python/sglang/srt/model_executor/cuda_graph_runner.py

+        positions = forward_batch.positions
+        if positions is None:
+            positions = clamp_position(forward_batch.seq_lens)
+        self.positions[:raw_num_token].copy_(positions)


@yukavio Do we need to assign spec_info.custom_mask to self.cuda_graph_custom_mask before replay? Looks like self.cuda_graph_custom_mask is not used when cuda graph replay()

merrymercy · 2024-11-23T03:15:08Z

python/sglang/srt/speculative/eagle_utils.py

+        logits_output.next_token_logits = logits_output.next_token_logits_bak[
+            accept_index
+        ]


Suggested change

logits_output.next_token_logits = logits_output.next_token_logits_bak[

accept_index

]

self.next_token_logits_back = next_token_logits_back

logits_output.next_token_logits = logits_output.next_token_logits[

accept_index

]

merrymercy · 2024-11-24T01:07:33Z

python/sglang/srt/layers/attention/__init__.py

+        encoder_lens: torch.Tensor = None,
+        spec_info: "SpecInput" = None,
+        is_draft_runner: bool = False,
+        forward_batch: ForwardBatch = None,


Style. If one argument can be None, we should use "Optional[xx] instead of xxx"

Suggested change

encoder_lens: torch.Tensor = None,

spec_info: "SpecInput" = None,

is_draft_runner: bool = False,

forward_batch: ForwardBatch = None,

encoder_lens: Optional[torch.Tensor] = None,

spec_info: Optional["SpecInput"] = None,

is_draft_runner: bool = False,

forward_batch: Optional[ForwardBatch] = None,

merrymercy · 2024-11-24T01:07:50Z

python/sglang/srt/layers/attention/__init__.py

        req_pool_indices: torch.Tensor,
        seq_lens: torch.Tensor,
        seq_lens_sum: int,
-        encoder_lens: Optional[torch.Tensor] = None,
+        encoder_lens=None,
+        forward_batch=None,


correct the type annotation, use Optional

merrymercy · 2024-11-24T02:04:09Z

python/sglang/srt/layers/attention/flashinfer_backend.py

@@ -130,8 +135,37 @@ def init_forward_metadata(self, forward_batch: ForwardBatch):
                forward_batch.seq_lens_sum,
                decode_wrappers=None,
                encoder_lens=forward_batch.encoder_lens,
+                forward_batch=forward_batch,


I do not think we need this for `indices_updater_decode

merrymercy · 2024-11-24T02:34:45Z

python/sglang/srt/layers/attention/flashinfer_backend.py

@@ -164,52 +199,102 @@ def init_cuda_graph_state(self, max_bs: int):
            cuda_graph_kv_indices.clone() for _ in range(self.num_wrappers - 1)
        ]

+        self.cuda_graph_custom_mask = torch.zeros(
+            (max_bs * (self.max_context_len + 7) // 8),


why is a rounding needed here?

merrymercy · 2024-11-24T02:36:12Z

python/sglang/srt/layers/attention/flashinfer_backend.py

-                    paged_kv_indices_buffer=self.cuda_graph_kv_indices[i],
-                    paged_kv_last_page_len_buffer=self.kv_last_page_len[:bs],
+        # speculative decoding verify stage
+        if spec_info is not None and not is_draft_runner:


we can use forward_mode.is_target_verify() as a condition.

merrymercy · 2024-11-24T02:36:33Z

python/sglang/srt/layers/attention/flashinfer_backend.py

        req_pool_indices: torch.Tensor,
        seq_lens: torch.Tensor,
        encoder_lens: torch.Tensor = None,
+        spec_info: SpecInput = None,
+        is_draft_runner: bool = False,


is_draft_runner seems not useful.

merrymercy · 2024-11-24T02:37:58Z

python/sglang/srt/layers/attention/flashinfer_backend.py

+        # speculative decoding verify stage
+        if spec_info is not None and not is_draft_runner:
+            for i in range(self.num_wrappers):
+                decode_wrappers.append(


call it prefill_wrappers

merrymercy · 2024-11-24T02:38:20Z

python/sglang/srt/layers/attention/flashinfer_backend.py

+        encoder_lens=None,
+        forward_batch=None,


keep type annotations.

merrymercy · 2024-11-24T02:39:48Z

python/sglang/srt/layers/logits_processor.py

+    hidden_states: Optional[torch.Tensor] = None
+    # backup of next_token_logits when use cuda graph
+    # id(next_token_logits_bak) == id(next_token_logits)
+    next_token_logits_bak: Optional[torch.Tensor] = None


we do not need this.

merrymercy · 2024-11-24T02:40:34Z

python/sglang/srt/managers/data_parallel_controller.py

@@ -96,7 +96,7 @@ def __init__(self, server_args, port_args) -> None:
            else:
                # This port is checked free in PortArgs.init_new.
                # We hold it first so that the next dp worker gets a different port
-                sockets.append(bind_port(tmp_port_args.nccl_port))
+                sockets.append(bind_port(tmp_port_args.nccl_port[0]))


use another variable instead of making it a list.

merrymercy mentioned this pull request Oct 6, 2024

Development Roadmap (2024 Q4) #1487

Open

37 tasks

kavioyu added 3 commits October 13, 2024 15:49

temp

70135d6

migrated to new upstream, need implement evict memory

65fae7b

prove single req

064cca6

fix bug for long generate due to eagle_verify_retrive kernel

cb01c64

kavioyu added 2 commits October 16, 2024 19:21

fix bug of eagle spec verify

df3de9d

support cuda graph

b7628f2

kavioyu added 2 commits October 22, 2024 19:28

support batch inference

e2634e9

temp

f557a06

yukavio force-pushed the spec_infer branch from 67126cb to f557a06 Compare October 23, 2024 02:29

yukavio requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners October 23, 2024 02:29

This comment was marked as duplicate.

Sign in to view

fix memeory leak

9987741

fix code style

12948c5

zhyncs self-assigned this Nov 11, 2024

kavioyu added 5 commits November 12, 2024 11:11

update to main and fix conflict

5618ebb

add cutex to ci dependency

8aec12b

add default value for speculative argument

0dc2b08

fix code style

faf9f50

fix ci

a62eb85

zhyncs changed the title ~~[WIP] Spec infer with EAGLE2~~ Speculative decoding with EAGLE2 Nov 14, 2024

arunpatala mentioned this pull request Nov 15, 2024

Will speculative decoding be supported? #555

Closed

update and fix conflict

84d454f

merrymercy requested changes Nov 22, 2024

View reviewed changes

kavioyu added 2 commits November 22, 2024 12:44

update and fix conflict

a1c9a5c

fix run with dp

5413e5d

jjjjohnson reviewed Nov 22, 2024

View reviewed changes

kavioyu added 2 commits November 22, 2024 19:36

fix github comment

4c93b3e

remove flashinfer_utils

a3cfa17

This was referenced Nov 22, 2024

EAGLE2: Eagle related part [1] #2128

Closed

EAGLE2: general part [2] #2129

Closed

merrymercy requested changes Nov 23, 2024

View reviewed changes

kavioyu added 4 commits November 23, 2024 12:43

update and fix conflict

8487760

change forward_mode.target_verify to is_extend

62d44cb

add eagle test to ci

1b5b3b7

test ci

060be3a

yukavio closed this Nov 24, 2024

yukavio mentioned this pull request Nov 24, 2024

Eagle speculative decoding part 4: Add EAGLE2 worker #2150

Merged

merrymercy requested changes Nov 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative decoding with EAGLE2 #1498

Speculative decoding with EAGLE2 #1498

yukavio commented Sep 24, 2024

Qiubo1 commented Sep 26, 2024

fengyang95 commented Oct 9, 2024 •

edited

Loading

yukavio commented Oct 11, 2024

yukavio commented Oct 11, 2024 •

edited

Loading

Qiubo1 commented Oct 16, 2024

Qiubo1 commented Oct 16, 2024

yukavio commented Oct 16, 2024

zhyncs commented Oct 21, 2024

fengyang95 commented Oct 21, 2024

yukavio commented Oct 21, 2024

yukavio commented Oct 21, 2024

This comment was marked as duplicate.

fengyang95 commented Oct 23, 2024

yukavio commented Nov 14, 2024

merrymercy commented Nov 17, 2024

yukavio commented Nov 18, 2024

yukavio commented Nov 18, 2024

merrymercy left a comment

merrymercy Nov 22, 2024

jjjjohnson Nov 22, 2024 •

edited

Loading

merrymercy Nov 23, 2024

merrymercy Nov 24, 2024

merrymercy Nov 24, 2024

merrymercy Nov 24, 2024

merrymercy Nov 24, 2024

merrymercy Nov 24, 2024

merrymercy Nov 24, 2024

merrymercy Nov 24, 2024

merrymercy Nov 24, 2024

merrymercy Nov 24, 2024

merrymercy Nov 24, 2024

Speculative decoding with EAGLE2 #1498

Speculative decoding with EAGLE2 #1498

Conversation

yukavio commented Sep 24, 2024

Motivation

Modifications

Checklist

Qiubo1 commented Sep 26, 2024

fengyang95 commented Oct 9, 2024 • edited Loading

yukavio commented Oct 11, 2024

yukavio commented Oct 11, 2024 • edited Loading

Qiubo1 commented Oct 16, 2024

Qiubo1 commented Oct 16, 2024

yukavio commented Oct 16, 2024

zhyncs commented Oct 21, 2024

fengyang95 commented Oct 21, 2024

yukavio commented Oct 21, 2024

yukavio commented Oct 21, 2024

This comment was marked as duplicate.

fengyang95 commented Oct 23, 2024

yukavio commented Nov 14, 2024

merrymercy commented Nov 17, 2024

yukavio commented Nov 18, 2024

yukavio commented Nov 18, 2024

merrymercy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjjjohnson Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fengyang95 commented Oct 9, 2024 •

edited

Loading

yukavio commented Oct 11, 2024 •

edited

Loading

jjjjohnson Nov 22, 2024 •

edited

Loading