Release v0.3.6
Highlights
- Reduce CPU overhead by enabling overlap scheduler by default. 1.1x higher throughput. (#2105, #2067, #2095)
- Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061)
- Cache-aware load balancer. 4x higher cache hit rate (#1934)
- Support xgrammar backend for grammar-guided decoding (#2056)
- Support Prometheus metrics (#1853, #1981)
- Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876)
- Support graceful termination (#1838) and watchdog (#1816)
- Support notebook-style documentation (https://sgl-project.github.io/)
- Add an offline benchmark script (#1968)
- Bug, deadlock, NaN, and OOM fixes (#2083, #1850, #1800, #1779, #1789, #1858)
- New models: Phi3-small (#2062), Gemma-2 reward model (#1954), GPT-2 (#1833)
What's Changed
- Fix edge case for truncated by @ByronHsu in #1747
- Fuse more ops & Simplify token mapping by @merrymercy in #1758
- [API] add get memory pool size by @Ying1123 in #1760
- Fix perf regression for set_kv_buffer by @merrymercy in #1765
- [Fix] Fix abort in data parallelism by @merrymercy in #1767
- Fix stop condition for <|eom_id|> by @merrymercy in #1766
- Update docs by @merrymercy in #1768
- Fix missing additional_stop_token_ids by @merrymercy in #1769
- Fix out of memory message. by @hnyls2002 in #1771
- Crash the server on warnings in CI by @merrymercy in #1772
- Fix the perf regression due to additional_stop_token_ids by @merrymercy in #1773
- Fix MockTokenizer in the unit tests by @merrymercy in #1774
- [Bug] Catch any errors caused by parsing json schema by @zolinthecow in #1776
- [Fix] Fix NaN issues by fixing the cuda graph padding values for flashinfer by @merrymercy in #1779
- [Fix] Fix cuda graph padding for triton attention backend by @merrymercy in #1782
- check user-specified model_max_len with hf derived max_model_len by @BBuf in #1778
- Re-introduce
get_cuda_graph_seq_len_fill_value
by @merrymercy in #1783 - Enhance the test case for chunked prefill and check memory leak by @merrymercy in #1785
- Fix seq_lens_sum for cuda graph runner in padded cases by @merrymercy in #1789
- Qwen2vl support cuda graph and disable radix cache by @yizhang2077 in #1780
- Fix log parsing in the chunked prefill unit tests by @merrymercy in #1793
- Fix memory leak when doing chunked prefill by @hnyls2002 in #1787
- [Fix] Fix the log parsing in chunked prefill uni tests by @merrymercy in #1794
- Revert "Fix memory leak when doing chunked prefill" by @merrymercy in #1797
- Fix logprob in the overlapped mode by @merrymercy in #1795
- Release v0.3.4.post2 by @merrymercy in #1796
- [Performance] Support both xgrammar and outlines for constrained decoding by @DarkSharpness in #1752
- [Fix] Fix --skip-tokenizer-init by @merrymercy in #1798
- move max_position_embeddings to the last by @hliuca in #1799
- add support for ipynb by @zhaochenyang20 in #1786
- Fix possible ZMQ hanging by @hnyls2002 in #1800
- Set
ZMQ
buffer size heuristic by @hnyls2002 in #1801 - Allow consecutive ports when launching multiple sglang servers. by @hnyls2002 in #1802
- fix int conversion for
SGLANG_CPU_COUNT
by @ByronHsu in #1803 - Update ci workflows by @merrymercy in #1804
- Update links by @merrymercy in #1805
- Simplify our docs with complicated functions into utils by @zhaochenyang20 in #1807
- Fix docs ci by @zhaochenyang20 in #1808
- Provide an argument to set the maximum batch size for cuda graph by @merrymercy in #1809
- Improve the user control of new_token_ratio by @merrymercy in #1811
- Update hyperparameter_tuning.md by @merrymercy in #1813
- Add a watch dog thread by @merrymercy in #1816
- Fix unit tests by @merrymercy in #1817
- Add openAI compatible API by @zhaochenyang20 in #1810
- Fix Triton decode kernel & ut by @ispobock in #1819
- support token ids in
engine.generate
by @ByronHsu in #1820 - Fix docs deploy ci by @zhaochenyang20 in #1821
- [router] rust-based router by @ByronHsu in #1790
- Fix update_weights deadlock for DP by @ByronHsu in #1825
- fix get_memory_pool_size deadlock for DP by @ByronHsu in #1830
- Support setting
use_thread
in therun_program
for easier debugging. by @liuyanyi in #1823 - [3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added by @HaiShaw in #1822
- stop_str of qwen2-vl template should be a tuple not a str by @yizhang2077 in #1834
- [FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… by @HaiShaw in #1835
- Gpt2 by @DanielC12321 in #1833
- Imporve openai api documents by @zhaochenyang20 in #1827
- Update docs by @merrymercy in #1839
- Update README.md by @merrymercy in #1840
- [Production] Drain requests before exit when receive SIGTERM by @Ying1123 in #1838
- [Performance, Hardware] MoE weights padding to AMD MI300x GPUs by @HaiShaw in #1836
- Fix suggest edit by @zhaochenyang20 in #1842
- [Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… by @HaiShaw in #1845
- Make decode log interval configurable by @ByronHsu in #1847
- Fix mixed chunked prefill by @merrymercy in #1850
- Refactor tokenizer manager by @ByronHsu in #1846
- Simplify documentation by @merrymercy in #1851
- Fix warnings in doc build by @merrymercy in #1852
- delete unused character by @geeker-smallwhite in #1855
- Fix memory leak for chunked prefill 2 by @merrymercy in #1858
- [Build, ROCm] Dockerfile.rocm for Instinct GPUs, with package updates by @HaiShaw in #1861
- Fix retraction + overlap by @hnyls2002 in #1860
- change file tree by @zhaochenyang20 in #1859
- Update vocab embedding deps and add TP switch by @ispobock in #1856
- minor: add human eval by @zhyncs in #1754
- Add vlm document by @zhaochenyang20 in #1866
- minor: update nightly eval by @zhyncs in #1867
- [3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. by @yichiche in #1871
- Improve docs and fix the broken links by @merrymercy in #1875
- Add a FAQ documentation by @merrymercy in #1877
- Update docs title by @merrymercy in #1879
- Update docs and workflow by @merrymercy in #1881
- Fix doc links by @merrymercy in #1882
- Fix incorrect context length for llama3.2-11b by @rchen19 in #1873
- add native api docs by @zhaochenyang20 in #1883
- Update index.rst to improve the order of docs by @merrymercy in #1885
- Native api by @zhaochenyang20 in #1886
- Fix docs by @merrymercy in #1889
- Fix docs ci by @zhaochenyang20 in #1888
- Fix docs by @merrymercy in #1890
- Fix ci and link error by @zhaochenyang20 in #1892
- Add engine api by @zhaochenyang20 in #1894
- turn off log for the offline engine by @zhaochenyang20 in #1895
- Do not use longest prefix matching when #queue-req is large by @merrymercy in #1896
- Simplify tokenizer manager by @merrymercy in #1899
- Allow passing dtype and max_new_tokens to HF reference script by @janimo in #1903
- Simplify tokenizer manager by @merrymercy in #1904
- Unify the model type checking by @merrymercy in #1905
- Escape backwards slash by @inakineitor in #1902
- feat: support truss endpoint for benchmark serving by @zhyncs in #1906
- Let reward model take text inputs instead of message lists by @merrymercy in #1907
- Release v0.3.5 by @merrymercy in #1908
- Fix regex docs by @merrymercy in #1909
- Add Reward API Docs etc by @zhaochenyang20 in #1910
- [Docs, ROCm] update install to cover ROCm with MI GPUs by @HaiShaw in #1915
- [router] Impl radix tree and set up CI by @ByronHsu in #1893
- Update CODEOWNERS by @ByronHsu in #1916
- Change judge to classify & Modify make file by @zhaochenyang20 in #1920
- [Doc] improve relative links and structure by @merrymercy in #1924
- support prometheus metrics by @Lzhang-hub in #1853
- [rust] refactor server and router by @ByronHsu in #1922
- minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces by @XuehaiPan in #1926
- Add Rust Router Python Binding by @austin362667 in #1891
- [Docs] fix 404 - Contributor Guide by @HaiShaw in #1942
- fix black in pre-commit by @zhaochenyang20 in #1940
- [Doc] fix docs by @merrymercy in #1949
- [Performance, Triton Kernel Args] extend_attention, optimize kern args to _fwd_kernel by @HaiShaw in #1941
- [ENV, ROCm] update environment settings by @HaiShaw in #1939
- Add a timeout for execute-notebook.yml by @merrymercy in #1951
- Update setup_github_runner.md by @merrymercy in #1952
- Monitoring documentation by @binarycrayon in #1933
- Gemma2 reward model support by @aqweteddy in #1954
- Remove the useless to_srt_kwargs by @merrymercy in #1955
- Adjust reward model's score module and pooler module order for reducing computation by @aqweteddy in #1956
- [Release, ROCm] release ROCm docker build for AMD MI GPUs by @HaiShaw in #1957
- Add sentence_transformers to CI dependency by @merrymercy in #1958
- [minor] Improve code style and compatibility by @merrymercy in #1961
- Update README.md's Slack invitation link by @zhaochenyang20 in #1962
- Updated Instructions on Profiling SGLang Infer System with AMD GPUs by @leishaoSC in #1966
- Fix metrics by @binarycrayon in #1963
- Initialize model_worker_batch variable by @qeternity in #1973
- Introducing SGLang Guru on Gurubase.io by @kursataktas in #1745
- Update README.md by @merrymercy in #1974
- Update pr-test-rust.yml to add a "finish" step by @merrymercy in #1975
- [Minor] Fix a typo in test_torchao.py by @merrymercy in #1976
- Clean up metrics code by @merrymercy in #1972
- [CI] balance unit tests by @merrymercy in #1977
- Specify
zmq
Version Requirement by @HuanzhiMao in #1982 - Simplify prometheus metrics by @merrymercy in #1981
- fix: update pyzmq version by @zhyncs in #1983
- docs: add shm size for docker run by @zhyncs in #1986
- qwen2vl fix bug for #1971 #1897 by @yizhang2077 in #1984
- [CI] Balance unit tests by @merrymercy in #1988
- Add gen-shared-prefix dataset in bench_serving by @ByronHsu in #1990
- [Performance, Triton] Optimize over mask compute to tl.load in fused_moe_kernel by @HaiShaw in #1980
- [rust] cache-aware DP - approx tree by @ByronHsu in #1934
- docs: add slides link in README by @zhyncs in #1997
- Add engine encode by @james-p-xu in #1995
- setup router python binding ci by @ByronHsu in #1999
- Add Engine::encode example by @james-p-xu in #2000
- Fix rust unit test and pypi token by @ByronHsu in #2001
- release router from py38 to py312 by @ByronHsu in #2002
- Bump router to 0.0.3 by @ByronHsu in #2004
- run rust test on ubuntu instead of 1-gpu-runner by @ByronHsu in #2003
- support internlm2-reward by @RangiLyu in #1994
- fix sglang_router not found by @ByronHsu in #2005
- [Minor] Remove unused imports by @merrymercy in #2006
- Fix a typo in io_struct.py by @merrymercy in #2008
- Fix weight loading for tied word embedding when TP > 1 by @merrymercy in #2009
- cleanup rust folder by @ByronHsu in #2010
- Filter empty prompt in random bench serving by @ispobock in #2011
- support echo=true and logprobs in openai api when logprobs=1 in lm-evaluation-harness by @BBuf in #1998
- Fix finish reason by @merrymercy in #2013
- fix a bug in v1_embeeding_request by @BBuf in #2014
- fix test_embedding_models prompt length too long's bug by @BBuf in #2015
- support parallel grammar preprocessing by @DarkSharpness in #1996
- Refactor grammar backend by @merrymercy in #2018
- Fix grammar backend for tensor parallelism by @merrymercy in #2020
- Release v0.3.5.post1 by @merrymercy in #2022
- Do not let invalid grammar crash the server by @merrymercy in #2023
- Fix dependency and error message for xgrammar by @merrymercy in #2024
- set content to empty string by @chottolabs in #2026
- chore: open lto and optimization in release profile by @ethe in #2028
- Add download_dir ServerArgs property by @pjyi2147 in #2027
- Github runner instructions for AMD by @HaiShaw in #2031
- Fix torch.compile for MoE by @merrymercy in #2033
- Fix unit tests by @merrymercy in #2034
- Fix outlines version by @merrymercy in #2036
- Expose no_stop_trim and skip_special_tokens in openai api by @merrymercy in #2039
- Offline LLM Engine Benchmark Throughput by @zolinthecow in #1968
- fix: align enable_overlap_scheduler naming between code and docs by @w1ndseeker in #2038
- Fix the default arguments of bench_offline_throughput.py & simplify detokenizer manager by @merrymercy in #2042
- benchmark json schema by @DarkSharpness in #2030
- Fix json benchmark by @merrymercy in #2043
- [Fix] Adjust default chunked prefill size and cuda graph max bs according to GPU memory capacity by @merrymercy in #2044
- Release v0.3.5.post2 by @merrymercy in #2046
- fix a small typo in docs by @BBuf in #2047
- Fix core (MI300X) with --enable-overlap by @HaiShaw in #2048
- Add Tensor Parallel to torch_native_llama by @kwen2501 in #1876
- Add get_amdgpu_memory_capacity() by @HaiShaw in #2049
- Fix weight update for data parallelism by @merrymercy in #2050
- Support DP MLA by @ispobock in #1970
- Fix illegal memory access in overlap mode & Use more fused triton kernels for building meta data by @merrymercy in #2051
- chore: update torch v2.5.1 by @zhyncs in #1849
- Revert "chore: update torch v2.5.1" by @merrymercy in #2063
- Remove monkey_patch_vllm_dummy_weight_loader by @merrymercy in #2064
- Deprecate --disable-flashinfer and --disable-flashinfer-sampling by @merrymercy in #2065
- Support cuda graph for DP attention by @ispobock in #2061
- Rename arguments
--disable-nan-detection
to--enable-nan-detection
by @merrymercy in #2066 - [Performance] Update xgrammar-related constrained decoding by @DarkSharpness in #2056
- add phi-3 small support by @Tushar-ml in #2062
- [Minor] Fix styles for overlap mode by @merrymercy in #2068
- Fix cuda illegal memory access in overlap mode by @merrymercy in #2070
- Tune the threshold for accuracy tests in CI by @merrymercy in #2071
- Crash the CI jobs on model import errors by @merrymercy in #2072
- support set role as 'tool' by @yukavio in #2075
- feat: update torch 2.5.1 by @zhyncs in #2069
- Rename layer_idx to layer_id for consistency by @janimo in #2078
- Fix chunked prefill with output logprob by @merrymercy in #2083
- Allow passing extra request body to bench_offline_throughput.py by @merrymercy in #2085
- Simplify logits penalizer by @merrymercy in #2086
- Use cuda event wait and synchronization instead of busy waiting by @merrymercy in #2089
- Fix: incorrect top_logprobs in chat completion by @ajwaitz in #2088
- minor: update gsm8k eval by @zhyncs in #2091
- Use native fp8 format on MI300X by @HaiShaw in #2094
- minor: add dataset dump and questions shuffle by @zhyncs in #2093
- Make constrained decoding work for overlap scheduler by @merrymercy in #2095
- Set schedule policy more conservative for DP attention by @ispobock in #2096
- Enable overlap by default by @merrymercy in #2067
- Update nightly-eval.yml by @merrymercy in #2100
- [feat] Add session control by @Ying1123 in #2073
- Allow skipping warmup in bench_offline_throughput.py by @merrymercy in #2103
- Move test_session_id.py to playground by @merrymercy in #2104
- Enable overlap scheduler by default for the triton attention backend by @merrymercy in #2105
- Error out when torchao-config option is not recognized by @jerryzh168 in #2107
- Turn off autotune for scaled mm for fp8 dynamic quant in torchao by @jerryzh168 in #2116
- ROCm: Fix MoE padding for none FP8 cases by @HaiShaw in #2111
- Add support for Qwen2-VL-based embedding models by @james-p-xu in #2055
- [router] add base_gpu_id server args & merged radix tree python reference by @ByronHsu in #2115
- Fix #2037 - Context length check does not take into out pad tokens for visual models by @jakep-allenai in #2106
- Rename sglang.bench_latency to sglang.bench_one_batch by @merrymercy in #2118
- Benchmark with Pytorch Profiler easily by @bjmsong in #2110
- [minor] Clean up unused imports by @merrymercy in #2122
- minor: update gsm8k threshold by @zhyncs in #2125
- chore: bump v0.3.6 by @zhyncs in #2120
New Contributors
- @zolinthecow made their first contribution in #1776
- @BBuf made their first contribution in #1778
- @DarkSharpness made their first contribution in #1752
- @hliuca made their first contribution in #1799
- @liuyanyi made their first contribution in #1823
- @DanielC12321 made their first contribution in #1833
- @geeker-smallwhite made their first contribution in #1855
- @yichiche made their first contribution in #1871
- @inakineitor made their first contribution in #1902
- @Lzhang-hub made their first contribution in #1853
- @XuehaiPan made their first contribution in #1926
- @austin362667 made their first contribution in #1891
- @binarycrayon made their first contribution in #1933
- @aqweteddy made their first contribution in #1954
- @leishaoSC made their first contribution in #1966
- @kursataktas made their first contribution in #1745
- @HuanzhiMao made their first contribution in #1982
- @james-p-xu made their first contribution in #1995
- @RangiLyu made their first contribution in #1994
- @chottolabs made their first contribution in #2026
- @ethe made their first contribution in #2028
- @w1ndseeker made their first contribution in #2038
- @kwen2501 made their first contribution in #1876
- @Tushar-ml made their first contribution in #2062
- @yukavio made their first contribution in #2075
- @ajwaitz made their first contribution in #2088
- @jakep-allenai made their first contribution in #2106
- @bjmsong made their first contribution in #2110
Full Changelog: v0.3.4.post1...v0.3.6