Skip to content

Release v0.3.6

Compare
Choose a tag to compare
@zhyncs zhyncs released this 22 Nov 11:36
· 248 commits to main since this release
9a00e6f

Highlights

  • Reduce CPU overhead by enabling overlap scheduler by default. 1.1x higher throughput. (#2105, #2067, #2095)
  • Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061)
  • Cache-aware load balancer. 4x higher cache hit rate (#1934)
  • Support xgrammar backend for grammar-guided decoding (#2056)
  • Support Prometheus metrics (#1853, #1981)
  • Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876)
  • Support graceful termination (#1838) and watchdog (#1816)
  • Support notebook-style documentation (https://sgl-project.github.io/)
  • Add an offline benchmark script (#1968)
  • Bug, deadlock, NaN, and OOM fixes (#2083, #1850, #1800, #1779, #1789, #1858)
  • New models: Phi3-small (#2062), Gemma-2 reward model (#1954), GPT-2 (#1833)

What's Changed

New Contributors

Full Changelog: v0.3.4.post1...v0.3.6