Releases · sgl-project/sglang

04 Jul 06:35

Ying1123

v0.1.18

2f11936

Release v0.1.18

Highlight

2x large batch prefill improvement with the new flashinfer kernels #579
Multi-node tensor parallelism #550
New model support: ChatGLM #516

What's Changed

Fix missing numpy dependency in pyproject.toml by @fpreiss in #524
Fix RAG nb, parea setup (parea -> parea-ai) by @fpreiss in #525
[Minor] Correct Optional type hints in api by @fpreiss in #526
Add ChatGLM Model Support by @Qubitium in #516
Fix Regression: Disable p2p for 4090 by @ZX-ModelCloud in #531
Decode Incrementally by @hnyls2002 in #517
Fix dependency by @merrymercy in #538
Fix dependency & crash issues by @Ying1123 in #539
Higher priority for user input of max_prefill_tokens & format by @Ying1123 in #540
Add disk cache for loading ShareGPT dataset. by @hnyls2002 in #542
Fix tp worker only checking req[0] for stream by @Qubitium in #546
Fix the Jump-Forward with Chinese by @hnyls2002 in #551
Update fused_moe by @merrymercy in #553
Multi-node Tensor Parallelism by @Ying1123 in #550
Update flashinfer to 0.0.5 by @merrymercy in #554
Follow-up fixes for flashinfer 0.0.5 by @merrymercy in #556
Fix latency benchmark by @hnyls2002 in #557
Clean up logits processor by @merrymercy in #558
Update test_flashinfer by @hnyls2002 in #560
Allow running with vllm==0.4.3 by @merrymercy in #561
Add a new arguments log_level_http to control the HTTP logging by @merrymercy in #563
Add sglang.bench_latency for offline benchmark by @merrymercy in #564
Warmup cublas by @merrymercy in #566
Increase the number of thread limitation for tp worker managers. by @merrymercy in #567
Update readme by @merrymercy in #568
Expose dtype argument by @merrymercy in #569
Update benchmark script by @Ying1123 in #571
Minor fix in compiler & format by @ZackZeng999 in #545
Update run_batch interface and max_prefill_tokens by @Ying1123 in #574
Fix flashinfer version by @PanJason in #576
[BugFix] gemma loading weights "lm_head.weight" key error by @dhgarcia in #577
Turn on flashinfer by default by @Ying1123 in #578
fix the broken server args by @hnyls2002 in #585
2x performance improvement for large prefill & Fix workspace conflicts by @Ying1123 in #579

New Contributors

@fpreiss made their first contribution in #524
@ZackZeng999 made their first contribution in #545
@PanJason made their first contribution in #576
@dhgarcia made their first contribution in #577

Full Changelog: v0.1.17...v0.1.18

Contributors

Qubitium, dhgarcia, and 7 other contributors

Assets 2

08 Jun 02:58

merrymercy

v0.1.17

e8a2327

Release v0.1.17

Highlights

Add data parallelim #480
Add speculative execution for OpenAI API #250
Update vllm to v0.4.3 for new quantization features #511
Better error handling (#457, #449, #514)

What's Changed

[Feat] Add llava qwen, llava mistral by @kcz358 in #419
Format code by @hnyls2002 in #441
Add finish_reason to OpenAI API by @mgerstgrasser in #446
Simplify port allocation by @merrymercy in #447
Add PUT for generate api by @Ying1123 in #448
Improve error handling & abort disconnected requests by @merrymercy in #449
Fix the broken --disable-radix-cache by @hnyls2002 in #451
openai chat speculative execution by @ChuyueSun in #250
Fix openai speculative execution by @Ying1123 in #456
Abort disconnected requests by @merrymercy in #457
Rename api_num_spec_tokens -> num_api_spec_tokens by @merrymercy in #458
Use model loader from vllm by @merrymercy in #459
port fp8 mixtral by @merrymercy in #460
fix test bug in srt_llava_next_test.py by @bingwork in #470
Add the instruction link to the LLaVA-NeXT-Video at README by @ZhangYuanhan-AI in #463
Improve logging & add logit cap by @merrymercy in #471
Optimize retract by @hnyls2002 in #440
Add benchmark scripts by @Ying1123 in #476
[Feat/Fix] Refactoring Llava models into single file by @Luodian in #475
Improve benchmark scripts & rename some scripts by @merrymercy in #477
Improve benchmark scripts & add more models by @merrymercy in #484
Support data parallelism (static) by @Ying1123 in #480
Make the server random by default by @merrymercy in #488
Revert "Make the server random by default" by @Ying1123 in #492
update the script: examples/usage/llava_video/srt_example_llava_v.sh by @ZhangYuanhan-AI in #491
Make the server random by default by @merrymercy in #493
Update vllm to v0.4.3 by @merrymercy in #511
remove redundant pad_input_ids function by @amosyou in #500
Litellm Backend by @huyiwen in #502
Fix rid state map leak + Refractor .finished by @Qubitium in #505
Crash the server when error or OOM happens by @merrymercy in #514
Update version to 0.1.17 by @merrymercy in #515

New Contributors

@kcz358 made their first contribution in #419
@mgerstgrasser made their first contribution in #446
@bingwork made their first contribution in #470
@amosyou made their first contribution in #500
@huyiwen made their first contribution in #502

Full Changelog: v0.1.16...v0.1.17

Contributors

Qubitium, huyiwen, and 10 other contributors

Assets 2

14 May 00:36

merrymercy

v0.1.16

e0ae5d4

v0.1.16

Highlight

Support more models: DBRX, Command-R, Gemma
Support llava-video (#423, https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)
Cache performance improvements (#418, #364)
Marlin quantization kernels
Many bug fixes
Update dependencies to be compatible with their latest versions

What's Changed

Fix Runtime missing some ServerArgs options by @Qubitium in #281
adding the triton docker build minimal example by @amirarsalan90 in #242
Fix flashinfer >= 0.0.3 compat by @Qubitium in #282
Fix Incorrect CURL Request Example in README by @amirarsalan90 in #287
enable marlin kernels by @qeternity in #286
Fix env (docker) compat due to file usage by @Qubitium in #288
Fix marlin model loading compat with autogptq by @Liurl21 in #290
Fix outlines-0.0.35 incompatibility by @ZhouGongZaiShi in #291
[Fix/Potential Bugs] Can not correctly import models in python/sglang/srt/models by @Luodian in #311
Use Anthropic messages API by @janimo in #304
Add StableLM model. by @janimo in #301
Support oai in benchmark/mmlu by @merrymercy in #323
Update version to v0.1.14 by @merrymercy in #324
Cleanup codebase: removed unnecessary code/logic by @Qubitium in #298
Update dependencies by @janimo in #326
Openrouter usage example by @janimo in #327
model_rpc style improvement by @hnyls2002 in #293
model_runner simplify by @hnyls2002 in #329
Logprobs Refractor by @hnyls2002 in #331
DBRX support by @hnyls2002 in #337
Add support for new autogptq quant_config.checkpoint_format by @Qubitium in #332
Fix llava parallelism/fork bug by @lockon-n in #315
Eliminate 2 gpu ops during sampling when logit_bias is zero by @hnyls2002 in #343
Revert "Eliminate 2 gpu ops during sampling when logit_bias is zero" by @hnyls2002 in #345
Eliminate 2 gpu ops during sampling when logit_bias is zero by @Qubitium in #338
Add timeout to get_meta_info by @SimoneRaponi in #346
Fix typos in infer_batch.py by @tom-doerr in #354
Time cost utils by @hnyls2002 in #355
Update README.md by @eltociear in #358
support command-r by @ZhouXingg in #369
Fix issue #367 – System message not supported for Anthropic (anthropic.BadRequestError) by @fronx in #368
Update model support in readme by @Ying1123 in #370
Optimize radix tree matching by @ispobock in #364
Reduce overhead when fork(1) by @hnyls2002 in #375
llama3 instruct template by @qeternity in #372
add .isort.cfg by @hnyls2002 in #378
Revert removing the unused imports by @hnyls2002 in #385
Benchmark Updates by @hnyls2002 in #382
Improve performance when running with full parallel by @hnyls2002 in #394
Minor: style improvement of radix_cache and memory_pool by @hnyls2002 in #395
Format Benchmark Code by @hnyls2002 in #399
Fix chatml template by @merrymercy in #406
Adding RAG tracing & eval cookbook using Parea by @joschkabraun in #390
SamplingParams add "spaces_between_special_tokens" argument by @ZhouXingg in #392
Organize Benchmark by @hnyls2002 in #381
Add Cohere Command R chat template by @noah-kim-theori in #411
Fix sync() when fork(1) by @hnyls2002 in #412
Include finish reason in meta info response by @qeternity in #415
Make public APIs more standard. by @hnyls2002 in #416
Compat with latest VLLM 0.4.2 main + fork.number rename + Flashinfer 0.0.4 by @Qubitium in #380
Optimize the memory usage of logits processor by @merrymercy in #420
Clean up by @merrymercy in #422
Fix logit processor bugs by @merrymercy in #427
Minor fix for the import path by @merrymercy in #428
Move openai api server into a separate file by @merrymercy in #429
Fix flashinfer by @merrymercy in #430
Update version to 0.1.15 by @merrymercy in #431
Misc fixes by @merrymercy in #432
Allow input_ids in the input of the /generate endpoint by @lolipopshock in #363
Improve error handling by @merrymercy in #433
Cache optimizations by @hnyls2002 in #418
Update readme by @merrymercy in #434
Raise errors for prompts that are too long by @merrymercy in #436
support llava video by @ZhangYuanhan-AI in #426
Fix streaming by @merrymercy in #437
Update version to 0.1.16 by @merrymercy in #438

New Contributors

@Qubitium made their first contribution in #281
@amirarsalan90 made their first contribution in #242
@Liurl21 made their first contribution in #290
@ZhouGongZaiShi made their first contribution in #291
@Luodian made their first contribution in #311
@janimo made their first contribution in #304
@lockon-n made their first contribution in #315
@SimoneRaponi made their first contribution in #346
@tom-doerr made their first contribution in #354
@ZhouXingg made their first contribution in #369
@fronx made their first contribution in #368
@ispobock made their first contribution in #364
@joschkabraun made their first contribution in #390
@noah-kim-theori made their first contribution in #411
@lolipopshock made their first contribution in #363
@ZhangYuanhan-AI made their first contribution in #426

Full Changelog: v0.1.13...v0.1.16

Contributors

fronx, janimo, and 19 other contributors

Assets 2

11 Mar 12:52

merrymercy

v0.1.13

4aa5dd2

Release v0.1.13

Highlights

Gemma Support by @hnyls2002 in #256
Add Together and AzureOpenAI examples by @merrymercy in #184

What's Changed

correct a mistake on the README.md by @yaya-sy in #182
correct reference dtype openai.py by @yaya-sy in #181
Add Together and AzureOpenAI examples by @merrymercy in #184
Fix server launch for jupyter notebook by @merrymercy in #186
Refactor decoding logprob and add completion_tokens_wo_jump_forward by @comaniac in #189
Pin outlines version by @comaniac in #196
Adjust outlines version. by @hnyls2002 in #200
Update README.md by @eltociear in #207
Added the ability to Modify the Context Length by @psych0v0yager in #210
Fix logprobs with logprob_start_len by @comaniac in #193
Support outlines > 0.0.31 by @comaniac in #219
Fix stop str merging by @hnyls2002 in #225
Fix interpreter.py get_var(var_name) in text iter when stream is not enabled by @exceedzhang in #198
fix chatml template by @qeternity in #195
Upload agent_calls.jsonl download link by @hnyls2002 in #226
Fix addr reuse in check_port by @hnyls2002 in #253
Add SSL Cert Functionality by @nivibilla in #224
Refactor ChatTemplate for Enhanced Clarity and Efficiency by @cubxxw in #201
Add set_var to interpreter.py by @1024th in #263
Add logo by @merrymercy in #275
Fix qwen config by @hnyls2002 in #261
replace skip_embed with input_embeds by @TideDra in #222
Gemma Support by @hnyls2002 in #256
Improve gemma and documentations by @merrymercy in #278
Organize server_args by @hnyls2002 in #277
Add Support for API Key Authentication by @alessiodallapiazza in #230
Fix RuntimeEndpoint by @merrymercy in #279
Update version to v0.1.13 by @merrymercy in #280

New Contributors

@psych0v0yager made their first contribution in #210
@exceedzhang made their first contribution in #198
@qeternity made their first contribution in #195
@cubxxw made their first contribution in #201
@1024th made their first contribution in #263
@TideDra made their first contribution in #222
@alessiodallapiazza made their first contribution in #230

Full Changelog: v0.1.12...v0.1.13

Contributors

alessiodallapiazza, exceedzhang, and 11 other contributors

Assets 2

11 Feb 14:49

merrymercy

v0.1.12

624b21e

Release v0.1.12

Highlights

Fast JSON Decoding (blog)
Output logprobs for decoding tokens
Multiple bug fixes

What's Changed

Fix no-cache mode by @Ying1123 in #136
Support Faster JSON decoding for llava by @hnyls2002 in #137
fix undfined variable by @yaya-sy in #142
jump-forward rename by @hnyls2002 in #144
Add warmup to SRT server by @comaniac in #146
add openai error handler with retry and logger by @ChuyueSun in #148
Temporary fix OpenAI API for Pydantic v1/v2 by @comaniac in #153
Add gptq quantization model support by @Arcmoon-Hu in #141
Support decode token logprobs by @comaniac in #130
Format code & move functions by @merrymercy in #155
[Submodule] Change FlashInfer to import by @comaniac in #156
add --disable-disk-cache by @hnyls2002 in #160
Add Auth Token to RuntimeEndpoint by @nivibilla in #162
Fix BaseCache metric by @comaniac in #170
import outlines by @hnyls2002 in #168
Fix token usage with jump forward by @comaniac in #174
Support extra field regex in OpenAI API by @comaniac in #172
Fix the chat template for llava-v1.6-34b & format code by @merrymercy in #177
Update version to 0.1.12 by @merrymercy in #178

New Contributors

@yaya-sy made their first contribution in #142
@ChuyueSun made their first contribution in #148
@nivibilla made their first contribution in #162

Full Changelog: v0.1.11...v0.1.12

Contributors

comaniac, Ying1123, and 6 other contributors

Assets 2

03 Feb 10:57

Ying1123

v0.1.11

f6bfe3a

Release v0.1.11

Highlights

Serve the official release demo of LLaVA v1.6 blog
Support Yi-VL example
Faster JSON decoding blog
Support QWen 2

What's Changed

Fix the error message and dependency of openai backend by @merrymercy in #71
Add an async example by @Ying1123 in #37
Add a note about triton version for older GPUs by @merrymercy in #72
Support load fine-tuned LLaVA model by @isaac-vidas in #80
Suppport qwen model and solve some problems by @Arcmoon-Hu in #75
Fix after QWen support by @merrymercy in #82
Fix the chat template for QWen by @merrymercy in #83
Fix SRT endpoint api json syntax by @CSWellesSun in #84
Return logprob for choices by @merrymercy in #87
Add health endpoint to SGLang runtime server by @isaac-vidas in #90
Llava-hd Support by @caoshiyi in #92
Bump the version to v0.1.8 by @merrymercy in #93
Improve Chinese character streaming when the last char is half Chinese word. by @haotian-liu in #95
Handle grayscale images in expand2square by @isaac-vidas in #97
support speculative execution for openai API by @parasol-aser in #48
fix batch error for llava-hd by @caoshiyi in #98
Dynamic model class loading by @comaniac in #101
Flush Cache API by @hnyls2002 in #103
Fix Mistral model loading by @comaniac in #108
Improve the control of streaming and improve the first token latency in streaming by @merrymercy in #117
Add qwen2 by @JustinLin610 in #114
Format code by @merrymercy in #118
Update quick start examples by @merrymercy in #120
Improve docs & Add JSON decode example by @merrymercy in #121
[Feature] Adds basic support for image content in OpenAI chat routes by @fozziethebeat in #113
[Feature] Allow specifying all ports to use in advance by @Ja1Zhou in #116
Add cache metrics by @comaniac in #119
Fix model loading & format code by @merrymercy in #125
Add city doc benchmark mode by @hnyls2002 in #129
Yi-VL Model by @BabyChouSr in #112
Fix is_multimodal_model judge by @hnyls2002 in #132
Add max_prefill_num_token into server arguments by @Ying1123 in #133
Release 0.1.11 by @Ying1123 in #134

New Contributors

@isaac-vidas made their first contribution in #80
@Arcmoon-Hu made their first contribution in #75
@CSWellesSun made their first contribution in #84
@haotian-liu made their first contribution in #95
@parasol-aser made their first contribution in #48
@JustinLin610 made their first contribution in #114
@fozziethebeat made their first contribution in #113
@Ja1Zhou made their first contribution in #116

Full Changelog: v0.1.6...v0.1.11

Contributors

fozziethebeat, parasol-aser, and 12 other contributors

Assets 2

21 Jan 10:09

merrymercy

v0.1.6

cc3ada9

Release v0.1.6

Major features

Add OpenAI-compatible API server (Completion and ChatCompletion)
Fix sgl.select

All PRs

Support v1/chat/completions by @comaniac in #50
Fix select and normalized logprobs by @merrymercy in #67
Bump version to 0.1.5 by @merrymercy in #33
Use HTTP link in 3rdparty module by @comaniac in #42
Document sampling parameters by @merrymercy in #45
Increase interpreter parallelism by @merrymercy in #46
Add a llava example by @merrymercy in #47
Support stream=True in v1/completions by @comaniac in #49
Format code & Improve readme by @merrymercy in #52
Fix the possible bug of decode out of memory by @hnyls2002 in #36
Improve error message & Add vicuna template by @merrymercy in #57
Update README.md by @eltociear in #58
Disk FSM cache and adjust code. by @hnyls2002 in #63
Fix select by @merrymercy in #64
Bump version to 0.1.6 by @merrymercy in #68

New Contributors

@comaniac made their first contribution in #42
@eltociear made their first contribution in #58

Full Changelog: v0.1.5...v0.1.6

Contributors

comaniac, merrymercy, and 2 other contributors

Assets 2

18 Jan 02:40

merrymercy

v0.1.5

22ec7bc

Release v0.1.5

What's Changed

Fix for T4 GPUs by @Ying1123 in #16
Gemini Backend by @caoshiyi in #9
Teak mem fraction by @merrymercy in #20
Add option to return metadata in async streaming by @BabyChouSr in #18
Expose more arguments to control the scheduling policy by @merrymercy in #32
Rename image_url to image_file by @BabyChouSr in #15
Improve docs by @merrymercy in #17
Improve docs & Rename Gemini -> VertexAI by @merrymercy in #19
Fix streaming by @merrymercy in #30

New Contributors

@BabyChouSr made their first contribution in #15
@Ying1123 made their first contribution in #16
@caoshiyi made their first contribution in #9

Full Changelog: v0.1.3...v0.1.5

Contributors

Ying1123, merrymercy, and 2 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlight

What's Changed

New Contributors

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlight

What's Changed

New Contributors

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlights

What's Changed

New Contributors

Contributors

Major features

All PRs

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Releases: sgl-project/sglang

Release v0.1.18

Highlight

What's Changed

New Contributors

Contributors

Release v0.1.17

Highlights

What's Changed

New Contributors

Contributors

v0.1.16

Highlight

What's Changed

New Contributors

Contributors

Release v0.1.13

Highlights

What's Changed

New Contributors

Contributors

Release v0.1.12

Highlights

What's Changed

New Contributors

Contributors

Release v0.1.11

Highlights

What's Changed

New Contributors

Contributors

Release v0.1.6

Major features

All PRs

New Contributors

Contributors

Release v0.1.5

What's Changed

New Contributors

Contributors