Skip to content

v0.7.3

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 20 Feb 17:08
· 19 commits to main since this release
ed6e907

Highlights

🎉 253 commits from 93 contributors, including 29 new contributors!

  • Deepseek enhancements:
    • Support for DeepSeek Multi-Token Prediction, 1.69x speedup in low QPS scenarios (#12755)
    • AMD support: DeepSeek tunings, yielding 17% latency reduction (#13199)
    • Using FlashAttention3 for MLA (#12807)
    • Align the expert selection code path with official implementation (#13474)
    • Optimize moe_align_block_size for deepseek_v3 (#12850)
  • V1 Engine:
    • LoRA Support (#10957, #12883)
    • Logprobs and prompt logprobs support (#9880), min_p sampling support (#13191), logit_bias in v1 Sampler (#13079)
    • Use msgpack for core request serialization (#12918)
    • Pipeline parallelism support (#12996, #13353, #13472, #13417, #13315)
    • Metrics enhancements: GPU prefix cache hit rate % gauge (#12592), iteration_tokens_total histogram (#13288), several request timing histograms (#12644)
    • Initial speculative decoding support with ngrams (#12193, #13365)

Model Support

  • Enhancement to Qwen2.5-VL: BNB support (#12944), LoRA (#13261), Optimizations (#13155)
  • Support Unsloth Dynamic 4bit BnB quantization (#12974)
  • IBM/NASA Prithvi Geospatial model (#12830)
  • Support Mamba2 (Codestral Mamba) (#9292), Bamba Model (#10909)
  • Ultravox Model: Support v0.5 Release (#12912)
  • transformers backend
    • Enable quantization support for transformers backend (#12960)
    • Set torch_dtype in TransformersModel (#13088)
  • VLM:
    • Implement merged multimodal processor for Mllama (#11427), GLM4V (#12449), Molmo (#12966)
    • Separate text-only and vision variants of the same model architecture (#13157)

Hardware Support

  • Pluggable platform-specific scheduler (#13161)
  • NVIDIA: Support nvfp4 quantization (#12784)
  • AMD:
    • Per-Token-Activation Per-Channel-Weight FP8 (#12501)
    • Tuning for Mixtral on MI325 and Qwen MoE on MI300 (#13503), Mixtral8x7B on MI300 (#13577)
    • Add intial ROCm support to V1 (#12790)
  • TPU: V1 Support (#13049)
  • Neuron: Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921)
  • Gaudi:
    • Support Contiguous Cache Fetch (#12139)
    • Enable long-contexts + LoRA support (#12812)

Engine Feature

  • Add sleep and wake up endpoint and v1 support (#12987)
  • Add /v1/audio/transcriptions OpenAI API endpoint (#12909)

Performance

  • Reduce TTFT with concurrent partial prefills (#10235)
  • LoRA - Refactor sgmv kernels (#13110)

Others

  • Make vLLM compatible with veRL (#12824)
  • Fixes for cases of FA2 illegal memory access error (#12848)
  • choice-based structured output with xgrammar (#12632)
  • Run v1 benchmark and integrate with PyTorch OSS benchmark database (#13068)

What's Changed

New Contributors

Full Changelog: v0.7.2...v0.7.3