Release Release v0.3.6 · sgl-project/sglang

Highlights

Reduce CPU overhead by enabling overlap scheduler by default. 1.1x higher throughput. (#2105, #2067, #2095)
Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061)
Cache-aware load balancer. 4x higher cache hit rate (#1934)
Support xgrammar backend for grammar-guided decoding (#2056)
Support Prometheus metrics (#1853, #1981)
Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876)
Support graceful termination (#1838) and watchdog (#1816)
Support notebook-style documentation (https://sgl-project.github.io/)
Add an offline benchmark script (#1968)
Bug, deadlock, NaN, and OOM fixes (#2083, #1850, #1800, #1779, #1789, #1858)
New models: Phi3-small (#2062), Gemma-2 reward model (#1954), GPT-2 (#1833)

What's Changed

Fix edge case for truncated by @ByronHsu in #1747
Fuse more ops & Simplify token mapping by @merrymercy in #1758
[API] add get memory pool size by @Ying1123 in #1760
Fix perf regression for set_kv_buffer by @merrymercy in #1765
[Fix] Fix abort in data parallelism by @merrymercy in #1767
Fix stop condition for <|eom_id|> by @merrymercy in #1766
Update docs by @merrymercy in #1768
Fix missing additional_stop_token_ids by @merrymercy in #1769
Fix out of memory message. by @hnyls2002 in #1771
Crash the server on warnings in CI by @merrymercy in #1772
Fix the perf regression due to additional_stop_token_ids by @merrymercy in #1773
Fix MockTokenizer in the unit tests by @merrymercy in #1774
[Bug] Catch any errors caused by parsing json schema by @zolinthecow in #1776
[Fix] Fix NaN issues by fixing the cuda graph padding values for flashinfer by @merrymercy in #1779
[Fix] Fix cuda graph padding for triton attention backend by @merrymercy in #1782
check user-specified model_max_len with hf derived max_model_len by @BBuf in #1778
Re-introduce get_cuda_graph_seq_len_fill_value by @merrymercy in #1783
Enhance the test case for chunked prefill and check memory leak by @merrymercy in #1785
Fix seq_lens_sum for cuda graph runner in padded cases by @merrymercy in #1789
Qwen2vl support cuda graph and disable radix cache by @yizhang2077 in #1780
Fix log parsing in the chunked prefill unit tests by @merrymercy in #1793
Fix memory leak when doing chunked prefill by @hnyls2002 in #1787
[Fix] Fix the log parsing in chunked prefill uni tests by @merrymercy in #1794
Revert "Fix memory leak when doing chunked prefill" by @merrymercy in #1797
Fix logprob in the overlapped mode by @merrymercy in #1795
Release v0.3.4.post2 by @merrymercy in #1796
[Performance] Support both xgrammar and outlines for constrained decoding by @DarkSharpness in #1752
[Fix] Fix --skip-tokenizer-init by @merrymercy in #1798
move max_position_embeddings to the last by @hliuca in #1799
add support for ipynb by @zhaochenyang20 in #1786
Fix possible ZMQ hanging by @hnyls2002 in #1800
Set ZMQ buffer size heuristic by @hnyls2002 in #1801
Allow consecutive ports when launching multiple sglang servers. by @hnyls2002 in #1802
fix int conversion for SGLANG_CPU_COUNT by @ByronHsu in #1803
Update ci workflows by @merrymercy in #1804
Update links by @merrymercy in #1805
Simplify our docs with complicated functions into utils by @zhaochenyang20 in #1807
Fix docs ci by @zhaochenyang20 in #1808
Provide an argument to set the maximum batch size for cuda graph by @merrymercy in #1809
Improve the user control of new_token_ratio by @merrymercy in #1811
Update hyperparameter_tuning.md by @merrymercy in #1813
Add a watch dog thread by @merrymercy in #1816
Fix unit tests by @merrymercy in #1817
Add openAI compatible API by @zhaochenyang20 in #1810
Fix Triton decode kernel & ut by @ispobock in #1819
support token ids in engine.generate by @ByronHsu in #1820
Fix docs deploy ci by @zhaochenyang20 in #1821
[router] rust-based router by @ByronHsu in #1790
Fix update_weights deadlock for DP by @ByronHsu in #1825
fix get_memory_pool_size deadlock for DP by @ByronHsu in #1830
Support setting use_thread in the run_program for easier debugging. by @liuyanyi in #1823
[3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added by @HaiShaw in #1822
stop_str of qwen2-vl template should be a tuple not a str by @yizhang2077 in #1834
[FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… by @HaiShaw in #1835
Gpt2 by @DanielC12321 in #1833
Imporve openai api documents by @zhaochenyang20 in #1827
Update docs by @merrymercy in #1839
Update README.md by @merrymercy in #1840
[Production] Drain requests before exit when receive SIGTERM by @Ying1123 in #1838
[Performance, Hardware] MoE weights padding to AMD MI300x GPUs by @HaiShaw in #1836
Fix suggest edit by @zhaochenyang20 in #1842
[Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… by @HaiShaw in #1845
Make decode log interval configurable by @ByronHsu in #1847
Fix mixed chunked prefill by @merrymercy in #1850
Refactor tokenizer manager by @ByronHsu in #1846
Simplify documentation by @merrymercy in #1851
Fix warnings in doc build by @merrymercy in #1852
delete unused character by @geeker-smallwhite in #1855
Fix memory leak for chunked prefill 2 by @merrymercy in #1858
[Build, ROCm] Dockerfile.rocm for Instinct GPUs, with package updates by @HaiShaw in #1861
Fix retraction + overlap by @hnyls2002 in #1860
change file tree by @zhaochenyang20 in #1859
Update vocab embedding deps and add TP switch by @ispobock in #1856
minor: add human eval by @zhyncs in #1754
Add vlm document by @zhaochenyang20 in #1866
minor: update nightly eval by @zhyncs in #1867
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. by @yichiche in #1871
Improve docs and fix the broken links by @merrymercy in #1875
Add a FAQ documentation by @merrymercy in #1877
Update docs title by @merrymercy in #1879
Update docs and workflow by @merrymercy in #1881
Fix doc links by @merrymercy in #1882
Fix incorrect context length for llama3.2-11b by @rchen19 in #1873
add native api docs by @zhaochenyang20 in #1883
Update index.rst to improve the order of docs by @merrymercy in #1885
Native api by @zhaochenyang20 in #1886
Fix docs by @merrymercy in #1889
Fix docs ci by @zhaochenyang20 in #1888
Fix docs by @merrymercy in #1890
Fix ci and link error by @zhaochenyang20 in #1892
Add engine api by @zhaochenyang20 in #1894
turn off log for the offline engine by @zhaochenyang20 in #1895
Do not use longest prefix matching when #queue-req is large by @merrymercy in #1896
Simplify tokenizer manager by @merrymercy in #1899
Allow passing dtype and max_new_tokens to HF reference script by @janimo in #1903
Simplify tokenizer manager by @merrymercy in #1904
Unify the model type checking by @merrymercy in #1905
Escape backwards slash by @inakineitor in #1902
feat: support truss endpoint for benchmark serving by @zhyncs in #1906
Let reward model take text inputs instead of message lists by @merrymercy in #1907
Release v0.3.5 by @merrymercy in #1908
Fix regex docs by @merrymercy in #1909
Add Reward API Docs etc by @zhaochenyang20 in #1910
[Docs, ROCm] update install to cover ROCm with MI GPUs by @HaiShaw in #1915
[router] Impl radix tree and set up CI by @ByronHsu in #1893
Update CODEOWNERS by @ByronHsu in #1916
Change judge to classify & Modify make file by @zhaochenyang20 in #1920
[Doc] improve relative links and structure by @merrymercy in #1924
support prometheus metrics by @Lzhang-hub in #1853
[rust] refactor server and router by @ByronHsu in #1922
minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces by @XuehaiPan in #1926
Add Rust Router Python Binding by @austin362667 in #1891
[Docs] fix 404 - Contributor Guide by @HaiShaw in #1942
fix black in pre-commit by @zhaochenyang20 in #1940
[Doc] fix docs by @merrymercy in #1949
[Performance, Triton Kernel Args] extend_attention, optimize kern args to _fwd_kernel by @HaiShaw in #1941
[ENV, ROCm] update environment settings by @HaiShaw in #1939
Add a timeout for execute-notebook.yml by @merrymercy in #1951
Update setup_github_runner.md by @merrymercy in #1952
Monitoring documentation by @binarycrayon in #1933
Gemma2 reward model support by @aqweteddy in #1954
Remove the useless to_srt_kwargs by @merrymercy in #1955
Adjust reward model's score module and pooler module order for reducing computation by @aqweteddy in #1956
[Release, ROCm] release ROCm docker build for AMD MI GPUs by @HaiShaw in #1957
Add sentence_transformers to CI dependency by @merrymercy in #1958
[minor] Improve code style and compatibility by @merrymercy in #1961
Update README.md's Slack invitation link by @zhaochenyang20 in #1962
Updated Instructions on Profiling SGLang Infer System with AMD GPUs by @leishaoSC in #1966
Fix metrics by @binarycrayon in #1963
Initialize model_worker_batch variable by @qeternity in #1973
Introducing SGLang Guru on Gurubase.io by @kursataktas in #1745
Update README.md by @merrymercy in #1974
Update pr-test-rust.yml to add a "finish" step by @merrymercy in #1975
[Minor] Fix a typo in test_torchao.py by @merrymercy in #1976
Clean up metrics code by @merrymercy in #1972
[CI] balance unit tests by @merrymercy in #1977
Specify zmq Version Requirement by @HuanzhiMao in #1982
Simplify prometheus metrics by @merrymercy in #1981
fix: update pyzmq version by @zhyncs in #1983
docs: add shm size for docker run by @zhyncs in #1986
qwen2vl fix bug for #1971 #1897 by @yizhang2077 in #1984
[CI] Balance unit tests by @merrymercy in #1988
Add gen-shared-prefix dataset in bench_serving by @ByronHsu in #1990
[Performance, Triton] Optimize over mask compute to tl.load in fused_moe_kernel by @HaiShaw in #1980
[rust] cache-aware DP - approx tree by @ByronHsu in #1934
docs: add slides link in README by @zhyncs in #1997
Add engine encode by @james-p-xu in #1995
setup router python binding ci by @ByronHsu in #1999
Add Engine::encode example by @james-p-xu in #2000
Fix rust unit test and pypi token by @ByronHsu in #2001
release router from py38 to py312 by @ByronHsu in #2002
Bump router to 0.0.3 by @ByronHsu in #2004
run rust test on ubuntu instead of 1-gpu-runner by @ByronHsu in #2003
support internlm2-reward by @RangiLyu in #1994
fix sglang_router not found by @ByronHsu in #2005
[Minor] Remove unused imports by @merrymercy in #2006
Fix a typo in io_struct.py by @merrymercy in #2008
Fix weight loading for tied word embedding when TP > 1 by @merrymercy in #2009
cleanup rust folder by @ByronHsu in #2010
Filter empty prompt in random bench serving by @ispobock in #2011
support echo=true and logprobs in openai api when logprobs=1 in lm-evaluation-harness by @BBuf in #1998
Fix finish reason by @merrymercy in #2013
fix a bug in v1_embeeding_request by @BBuf in #2014
fix test_embedding_models prompt length too long's bug by @BBuf in #2015
support parallel grammar preprocessing by @DarkSharpness in #1996
Refactor grammar backend by @merrymercy in #2018
Fix grammar backend for tensor parallelism by @merrymercy in #2020
Release v0.3.5.post1 by @merrymercy in #2022
Do not let invalid grammar crash the server by @merrymercy in #2023
Fix dependency and error message for xgrammar by @merrymercy in #2024
set content to empty string by @chottolabs in #2026
chore: open lto and optimization in release profile by @ethe in #2028
Add download_dir ServerArgs property by @pjyi2147 in #2027
Github runner instructions for AMD by @HaiShaw in #2031
Fix torch.compile for MoE by @merrymercy in #2033
Fix unit tests by @merrymercy in #2034
Fix outlines version by @merrymercy in #2036
Expose no_stop_trim and skip_special_tokens in openai api by @merrymercy in #2039
Offline LLM Engine Benchmark Throughput by @zolinthecow in #1968
fix: align enable_overlap_scheduler naming between code and docs by @w1ndseeker in #2038
Fix the default arguments of bench_offline_throughput.py & simplify detokenizer manager by @merrymercy in #2042
benchmark json schema by @DarkSharpness in #2030
Fix json benchmark by @merrymercy in #2043
[Fix] Adjust default chunked prefill size and cuda graph max bs according to GPU memory capacity by @merrymercy in #2044
Release v0.3.5.post2 by @merrymercy in #2046
fix a small typo in docs by @BBuf in #2047
Fix core (MI300X) with --enable-overlap by @HaiShaw in #2048
Add Tensor Parallel to torch_native_llama by @kwen2501 in #1876
Add get_amdgpu_memory_capacity() by @HaiShaw in #2049
Fix weight update for data parallelism by @merrymercy in #2050
Support DP MLA by @ispobock in #1970
Fix illegal memory access in overlap mode & Use more fused triton kernels for building meta data by @merrymercy in #2051
chore: update torch v2.5.1 by @zhyncs in #1849
Revert "chore: update torch v2.5.1" by @merrymercy in #2063
Remove monkey_patch_vllm_dummy_weight_loader by @merrymercy in #2064
Deprecate --disable-flashinfer and --disable-flashinfer-sampling by @merrymercy in #2065
Support cuda graph for DP attention by @ispobock in #2061
Rename arguments --disable-nan-detection to --enable-nan-detection by @merrymercy in #2066
[Performance] Update xgrammar-related constrained decoding by @DarkSharpness in #2056
add phi-3 small support by @Tushar-ml in #2062
[Minor] Fix styles for overlap mode by @merrymercy in #2068
Fix cuda illegal memory access in overlap mode by @merrymercy in #2070
Tune the threshold for accuracy tests in CI by @merrymercy in #2071
Crash the CI jobs on model import errors by @merrymercy in #2072
support set role as 'tool' by @yukavio in #2075
feat: update torch 2.5.1 by @zhyncs in #2069
Rename layer_idx to layer_id for consistency by @janimo in #2078
Fix chunked prefill with output logprob by @merrymercy in #2083
Allow passing extra request body to bench_offline_throughput.py by @merrymercy in #2085
Simplify logits penalizer by @merrymercy in #2086
Use cuda event wait and synchronization instead of busy waiting by @merrymercy in #2089
Fix: incorrect top_logprobs in chat completion by @ajwaitz in #2088
minor: update gsm8k eval by @zhyncs in #2091
Use native fp8 format on MI300X by @HaiShaw in #2094
minor: add dataset dump and questions shuffle by @zhyncs in #2093
Make constrained decoding work for overlap scheduler by @merrymercy in #2095
Set schedule policy more conservative for DP attention by @ispobock in #2096
Enable overlap by default by @merrymercy in #2067
Update nightly-eval.yml by @merrymercy in #2100
[feat] Add session control by @Ying1123 in #2073
Allow skipping warmup in bench_offline_throughput.py by @merrymercy in #2103
Move test_session_id.py to playground by @merrymercy in #2104
Enable overlap scheduler by default for the triton attention backend by @merrymercy in #2105
Error out when torchao-config option is not recognized by @jerryzh168 in #2107
Turn off autotune for scaled mm for fp8 dynamic quant in torchao by @jerryzh168 in #2116
ROCm: Fix MoE padding for none FP8 cases by @HaiShaw in #2111
Add support for Qwen2-VL-based embedding models by @james-p-xu in #2055
[router] add base_gpu_id server args & merged radix tree python reference by @ByronHsu in #2115
Fix #2037 - Context length check does not take into out pad tokens for visual models by @jakep-allenai in #2106
Rename sglang.bench_latency to sglang.bench_one_batch by @merrymercy in #2118
Benchmark with Pytorch Profiler easily by @bjmsong in #2110
[minor] Clean up unused imports by @merrymercy in #2122
minor: update gsm8k threshold by @zhyncs in #2125
chore: bump v0.3.6 by @zhyncs in #2120

New Contributors

@zolinthecow made their first contribution in #1776
@BBuf made their first contribution in #1778
@DarkSharpness made their first contribution in #1752
@hliuca made their first contribution in #1799
@liuyanyi made their first contribution in #1823
@DanielC12321 made their first contribution in #1833
@geeker-smallwhite made their first contribution in #1855
@yichiche made their first contribution in #1871
@inakineitor made their first contribution in #1902
@Lzhang-hub made their first contribution in #1853
@XuehaiPan made their first contribution in #1926
@austin362667 made their first contribution in #1891
@binarycrayon made their first contribution in #1933
@aqweteddy made their first contribution in #1954
@leishaoSC made their first contribution in #1966
@kursataktas made their first contribution in #1745
@HuanzhiMao made their first contribution in #1982
@james-p-xu made their first contribution in #1995
@RangiLyu made their first contribution in #1994
@chottolabs made their first contribution in #2026
@ethe made their first contribution in #2028
@w1ndseeker made their first contribution in #2038
@kwen2501 made their first contribution in #1876
@Tushar-ml made their first contribution in #2062
@yukavio made their first contribution in #2075
@ajwaitz made their first contribution in #2088
@jakep-allenai made their first contribution in #2106
@bjmsong made their first contribution in #2110

Full Changelog: v0.3.4.post1...v0.3.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.3.6

Highlights

What's Changed

New Contributors

Contributors