Skip to content

Release v0.2.13

Compare
Choose a tag to compare
@Ying1123 Ying1123 released this 16 Aug 05:16
5bd9537

Highlights

  • New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
  • New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
  • Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
  • More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
  • Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)

What's Changed

New Contributors

Full Changelog: v0.2.9...v0.2.13