Development Roadmap (2024 Q4) #1487

Ying1123 · 2024-09-21T22:38:00Z

Here is the development roadmap for 2024 Q4. Contributions and feedback are welcome (Join Bi-weekly Development Meeting). Previous 2024 Q3 roadmap can be found in #634.

Performance

Hide CPU overhead with overlapped scheduler (Faster overlap mode scheduler #1738, Enable overlap by default #2067)
Support speculative decoding
- Eagle Speculative EAGLE2 #2150
- Reference-based. Reference speculative decoding #270
- Medusa head [Feature] plan to support medusa? #859
- Draft model based.
Sparse Attention Support double sparsity #1459
Faster grammar parsing library for constrained decoding [Performance] Support both xgrammar and outlines for constrained decoding #1752
Multi-layer radix cache (GPU/CPU/Disk) @xiezhq-hermann
Improve the performance of mixed chunked prefill. see a draft Rewrite mixed chunked prefill #1383
Integrate CuDNN paged attention kernels

Parallelism

Support sequence parallelism [Feature] Add initial support for sequence parallelism #1436. Related paper
Support pipeline parallelism.
Support expert parallelism + data parallelism for DeepSeek/MoE models. @ispobock
- Data parallelism Support DP MLA #1970
- Expert parallelism # [Feature] Expert parallelism support #1435
Implement a better cache-aware load balancer for data parallelism. [router] cache-aware load-balancing router v1 #2114 [Feature] Cache-aware Data Parallel Router #1732 @ByronHsu @yichuan520030910320
Overlap communication in tensor parallelsim. @ZhuohaoL
Support disaggregated serving to separate prefill and decoding.

Hardware Coverage

AMD optimizations. cc @HaiShaw
- CK kernels
- Setup CI (accuracy/performance) for AMD
Intel XPU support.
- [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch #1480
- Add initial support for intel Gaudi accelerators #2121

Model Coverage

Multi-modal models
- Llama 3.2 Vision Llama3.2 vision model support #1551
- QWen2-VL Support qwen2 vl model #1546
- GLM 4V Add GLM-4v Multimodal Model support for SGLang #1641
- VILA
- Phi-vision
- FishSpeech audio model support
- Ultravox
Language models
- Mamba models @rahulbatra85 @HaiShaw
- xLSTM
Reward models
- [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B #1525
- Gemma2 reward model support #1954

New features

Integrate with LMCache https://github.com/LMCache/LMCache
A padded batch mode to make results more deterministic

sglang/docs/references/faq.md

Line 3 in 8912b76

## The results are not deterministic, even with a temperature of 0
Performance optimizations for multi-LoRA serving [LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728

Quantization

@HaiShaw @zhyncs @ispobock

Torchao integration Add llama implementation with no tensor parallel linears #1561
Turbomind operators integration
More CUTLASS mixed precision gemm integration
KV cache quantization (more formats + scaling factor)

Server API

Support directly taking embedding as inputs. [Feature] Generation Inputs: input_embeds #745
Add APIs for using the inference engine in a single script without launching a separate server. See also examples.
- Provide an offline engine API #1567
Support endpoint other than OpenAI (Anthropic, Mistral) in the language frontend.
Better APIs to support RL trainers, including https://github.com/huggingface/trl and https://github.com/OpenRLHF/OpenRLHF @zhaochenyang20
Support generalized reward API (adding linear layers to any Causal LM to get the reward) https://github.com/OpenRLHF/OpenRLHF @zhaochenyang20

Observability

Integrate Grafana / Prometheus
- support prometheus metrics #1853 [WIP] Prometheus Metrics #1461

Others

Notebook-style interactive tutorials. @zhaochenyang20
Compiler mode optimizations for the language (e.g. support sending a full serialized SGL program to the server). @hnyls2002
Memory pool refactor to better support mixing different attention layers (e.g., interleaved window attention). @Ying1123
Make vLLM an optional dependency. @zhyncs @ByronHsu @yizhang2077 [Feature] Make vLLM optional in model code #1673

fengyang95 · 2024-09-22T02:02:41Z

Are there any plans to optimize long context latency?

lumiere-ml · 2024-10-17T02:24:33Z

Hi，can I help for Multi-layer radix cache (GPU/CPU/Disk)？ Really insterested in that.

tanzelin430 · 2024-10-17T11:58:58Z

Are there any plans to optimize long context latency?

I am interested in contributing to P-D split inference architechure and I have machines that support me to develop the architechure, if you guys got any related develop plans please let me know. Thank you @Ying1123 @zhyncs @fengyang95

merrymercy · 2024-10-19T13:58:47Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

zhyncs · 2024-10-20T06:01:03Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

tanzelin430 · 2024-10-20T06:14:54Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

thanks for invitation, I am in slack now. forward to collaberate with you

lumiere-ml · 2024-10-20T09:01:30Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

Thanks for your invitation！

Edenzzzz · 2024-11-11T03:30:14Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

Thanks for your invitation！

@lumiere-ml @zhyncs I'm also very interested, could you share which channel you're using to discuss?
Perhaps we can combine radix tree prefix matching with P-D disaggregation similar to Mooncake?

mfdj2002 · 2024-11-21T07:40:18Z

If no one is actively working on supporting pipeline parallelism, I'm down to help

Edenzzzz · 2024-11-25T17:20:24Z

@mfdj2002 I think @CalvinXKY has expressed interest on slack, you can chat with him there

merrymercy · 2024-11-26T00:25:30Z

No one is working on pipeline parallelism. Feel free to contribute one.

m0g1cian · 2024-12-03T07:51:23Z

I recently completed a reward model implementation for RMs trained by LlamaFactory. Everything worked well but I've noticed a relatively small value diff in last hidden states between my SGLang implementation and the counterpart in TRL (resulting a ROC loss of ~0.3%)

Regardless, I think I can help with the task "Support generalized reward API (adding linear layers to any Causal LM to get the reward)"

kuangdao · 2024-12-04T06:32:48Z

i am interested in sequence parallelism, i want to know if the sequence parallelism will use the method of Context Parallelism for Scalable Million-Token Inference , thanks

zhaochenyang20 · 2024-12-04T20:38:26Z

I recently completed a reward model implementation for RMs trained by LlamaFactory. Everything worked well but I’ve noticed a relatively small value diff in last hidden states between my SGLang implementation and the counterpart in TRL (resulting a ROC loss of ~0.3%)

Regardless, I think I can help with the task “Support generalized reward API (adding linear layers to any Causal LM to get the reward)”

Amazing, could you please send an Email with your wechat or other connection to zhaochenyang20@gmail.com

We can also discuss this on our Slack. find zhaochenyang20@gmail.com on sglang slack plz!

@m0g1cian

trh11111 · 2024-12-11T02:50:50Z

I am also very interested in the scenario of PD disaggregation, and I hope to combine radix tree with PD disaggregation for some experiments. I saw that someone mentioned this in October. May I ask how the current development plan is progressing?

zhaochenyang20 · 2024-12-11T03:36:24Z

I am also very interested in the scenario of PD disaggregation, and I hope to combine radix tree with PD disaggregation for some experiments. I saw that someone mentioned this in October. May I ask how the current development plan is progressing?

@trh11111 Yeah. We have new members joined our team work on this and PD disaggregation is the first-priority in our developmap for our next quoter.

tanzelin430 · 2024-12-11T09:14:57Z

I am also very interested in the scenario of PD disaggregation, and I hope to combine radix tree with PD disaggregation for some experiments. I saw that someone mentioned this in October. May I ask how the current development plan is progressing?

Hi, I have just finish my graduation recruiment senson and am working on my ATC paper. I'll be soon looking into the development

zhaochenyang20 · 2024-12-11T23:08:57Z

I am also very interested in the scenario of PD disaggregation, and I hope to combine radix tree with PD disaggregation for some experiments. I saw that someone mentioned this in October. May I ask how the current development plan is progressing?

Hi, I have just finish my graduation recruiment senson and am working on my ATC paper. I'll be soon looking into the development

@trh11111 if you feel interested in this part, could reach out to us on slack.

mpjlu · 2024-12-18T02:40:16Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

how to join this slack channel

zhyncs · 2024-12-20T18:06:47Z

Hi @mpjlu https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2rtikx2pv-DUfPrhx2SaNAq~47YtV1XQ

mpjlu · 2024-12-22T01:04:47Z

Hi @mpjlu https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2rtikx2pv-DUfPrhx2SaNAq~47YtV1XQ

Thanks

Ying1123 changed the title ~~[WIP] Development Roadmap (2024 Q4)~~ Development Roadmap (2024 Q4) Sep 22, 2024

zhyncs pinned this issue Sep 22, 2024

zhyncs mentioned this issue Sep 22, 2024

[Feature] Are there plans to implement a prefill-decode split inference architecture? #1080

Closed

ByronHsu mentioned this issue Oct 4, 2024

Provide an offline engine API #1567

Merged

3 tasks

ByronHsu mentioned this issue Oct 15, 2024

Support vLLM-style rope flashinfer-ai/flashinfer#530

Closed

zhaochenyang20 mentioned this issue Oct 20, 2024

Add documentations for Installation #1733

Closed

3 tasks

zhyncs mentioned this issue Nov 1, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

liangzelang mentioned this issue Nov 15, 2024

[Feature] Expert parallelism support #1435

Closed

2 tasks

zhaochenyang20 mentioned this issue Dec 10, 2024

[Feature] Support General Reward Model #2427

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Roadmap (2024 Q4) #1487

Development Roadmap (2024 Q4) #1487

Ying1123 commented Sep 21, 2024 •

edited by HaiShaw

Loading

fengyang95 commented Sep 22, 2024

lumiere-ml commented Oct 17, 2024

tanzelin430 commented Oct 17, 2024

merrymercy commented Oct 19, 2024

zhyncs commented Oct 20, 2024

tanzelin430 commented Oct 20, 2024

lumiere-ml commented Oct 20, 2024

Edenzzzz commented Nov 11, 2024 •

edited

Loading

mfdj2002 commented Nov 21, 2024

Edenzzzz commented Nov 25, 2024

merrymercy commented Nov 26, 2024

m0g1cian commented Dec 3, 2024

kuangdao commented Dec 4, 2024

zhaochenyang20 commented Dec 4, 2024

trh11111 commented Dec 11, 2024

zhaochenyang20 commented Dec 11, 2024 •

edited

Loading

tanzelin430 commented Dec 11, 2024

zhaochenyang20 commented Dec 11, 2024

mpjlu commented Dec 18, 2024

zhyncs commented Dec 20, 2024

mpjlu commented Dec 22, 2024

Development Roadmap (2024 Q4) #1487

Development Roadmap (2024 Q4) #1487

Comments

Ying1123 commented Sep 21, 2024 • edited by HaiShaw Loading

Performance

Parallelism

Hardware Coverage

Model Coverage

New features

Quantization

Server API

Observability

Others

fengyang95 commented Sep 22, 2024

lumiere-ml commented Oct 17, 2024

tanzelin430 commented Oct 17, 2024

merrymercy commented Oct 19, 2024

zhyncs commented Oct 20, 2024

tanzelin430 commented Oct 20, 2024

lumiere-ml commented Oct 20, 2024

Edenzzzz commented Nov 11, 2024 • edited Loading

mfdj2002 commented Nov 21, 2024

Edenzzzz commented Nov 25, 2024

merrymercy commented Nov 26, 2024

m0g1cian commented Dec 3, 2024

kuangdao commented Dec 4, 2024

zhaochenyang20 commented Dec 4, 2024

trh11111 commented Dec 11, 2024

zhaochenyang20 commented Dec 11, 2024 • edited Loading

tanzelin430 commented Dec 11, 2024

zhaochenyang20 commented Dec 11, 2024

mpjlu commented Dec 18, 2024

zhyncs commented Dec 20, 2024

mpjlu commented Dec 22, 2024

Ying1123 commented Sep 21, 2024 •

edited by HaiShaw

Loading

Edenzzzz commented Nov 11, 2024 •

edited

Loading

zhaochenyang20 commented Dec 11, 2024 •

edited

Loading