What's Changed
- [JSON mode]: FormatEnforcer use cudaMallocHost for scores buffer by @WangNorthSea in #56
- [A16W8 & A8W8]: further optimization for Ampere A16W8 fused gemm kernel 2. fix lora doc by @wyajieha in #58
- [Multimodal]: Support LLM quantization with GPTQ and AXWY by @x574chen in #60
- [PKG]: Reduce package size by only compiling flash-attn src with hdim128 by @laiwenzh in #62
- [MOE]: add high performance moe kernel; fix a16w8 compile bug for sm<80 by @laiwenzh in #67
New Contributors
Full Changelog: v2.0.0...v2.1.0