chore(main): release 0.3.0 #698

github-actions · 2024-12-25T07:56:04Z

🤖 I have created a release beep boop

0.3.0 (2024-12-25)

Features

add MultiLevelCascadeAttentionWrapper API (#462) (1e37989)
add rotary_dim argument to rope APIs for partial apply rope (#599) (eb9bc71)
add use_tensor_cores option to decode kernels to accelerate GQA (#317) (3b50dd5)
add a use_softmax field in variant class (#533) (d81af97)
add accept num, emit num metric for ChainSpeculativeSampling (#450) (fa38b5e)
add an option non_blocking to plan function (#622) (560af6f)
add gelu_and_mul (#474) (9ee26e7)
add gemma_rmsnorm and gemma_fused_add_rmsnorm (#477) (1a6b17e)
add group gemm operators (#282) (e08ba42)
add group size 3 to GQA decode dispatch (#558) (6227562)
add JIT compilation support for FA3 templates (#672) (d4e8d79)
add llama 3.1 style rope (#401) (4c89dec)
Add mask to merge_state_in_place (#372) (e14fa81)
add mma instructions for fp8 (#179) (d305798)
adding sm_scale field for all attention APIs (#145) (85d4018)
allow the cascade kernels to be executed using varying sequence lenghts (#627) (92ac440)
append attention kernels for fp8 kv-cache (#420) (906c2f5)
CUDAGraph compatibility of multi-level cascade inference APIs (#586) (2332e8a)
customize logits_soft_cap value (#339) (a2498f5)
decouple float and int workspace buffer (#442) (a7ee566)
deterministic sampling (#417) (0dd801d)
enable head_dim=256 for attention kernels (#132) (0372acc)
expose decoupled kv-cache to pytorch api (#383) (457a0ae)
expose pytorch api for block sparse attention (#375) (4bba6fa)
fix the maximal grid dimension in prefill planning with CUDA graphs (#639) (86ca89a)
Fused GPU sampling kernel for joint top-k & top-p sampling (#374) (6e028eb)
improve the precision of the FusedAddRMSNormKernel function (#587) (c7dc921)
initial support of distributed operators (#289) (03553da)
initial support of logits hook (#298) (ab1e2ad)
JIT compilation (#507) (3613a5b)
mma rowsum for fp8 (#180) (5af935c)
modify group-gemm stage number (#497) (52dab1d)
more sampling operator options (#431) (68df9c4)
non-contiguous query with paged kv cache (#553) (89f2c4a)
non-inplace rope operators (#405) (74ffba1)
pass a dynamic token count to the cascade kernels (#635) (5fe9f7d)
pytorch api of fp8 kv-cache (#156) (66ee066)
Separate Q and KV dtypes for decode (#286) (5602659)
simplify prefill JIT compilation (#605) (fe4f898)
sliding window attention (#406) (28cffd3)
specify gemm backend (#648) (0cc1a51)
support ALiBi (#146) (383518b)
support any num_heads for get_alibi_slope (#200) (b217a6f)
support bmm fp8 (#469) (f1c0b68)
support cached cos/sin in rope APIs (#585) (83e541d)
support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
support custom attention mask in prefill/append attention kernels (#266) (7304282)
support fused add rmsnorm (#419) (b781513)
support fused gelu tanh mul (#434) (2c9d1c3)
support fused silu mul (#427) (ea0ba9a)
support huggingface transformer style rope interface (#568) (4f40420)
support non-contiguous (packed) input for prefill kernels (#404) (68c3719)
support sm90 cutlass group gemm (#509) (794bdda)
torch custom_op fix for rope (#569) (3e104bc)
torch custom_op support: norm (#552) (f6e0010)
torch.compile and custom_op support (#554) (9bf916f)
warmup for jit kernel tests (#629) (8f5f349)

Bug Fixes

add python 3.9 wheels to ci/cd (#114) (2d8807d)
AOT compiler flags on non-sm90 (#522) (0aa4726)
batch decode kernel redundant store output to gmem (#505) (90e42a7)
bugfix to pr 135 (#136) (3d55c71)
compatible with torch 2.2 (#478) (ac41d1b)
disable other warp layout because of large binary size (#326) (c146e06)
fatal bugfix in batch decode operator (#177) (238563f)
fix bugs introduced in #132 (#135) (9b7b0b9)
fix FindThrust.cmake (#161) (30fa584)
fix macro to suppress compilation warning (#231) (94bcf6f)
fix python package dispatch error message (#182) (8eed01c)
fix the re expression in build wheel scripts (#119) (2e982ea)
SingleDecodeWithKVCache meets illegal memory access when setting input tensors to cuda:1 #452 (b53a46f)
remove 8 from default page size (#233) (62343e6)
remove redundant load (#495) (2de16b0)
resolve cu121 compile wired issue (2740a02)
resolve cu121 compile wired issue (#446) (5f0159e)
to fix the re used in matching wheel names (#121) (2b3fcde)
update bmm fp8 test (#487) (45eac04)
version names cannot include multiple + (#118) (af6bd10)
version naming issue (#117) (c849a90)

Performance Improvements

accelerate alibi (#365) (4f0a9f9)
accelerate gqa performance (#356) (e56ddad)
accelerate JIT compilation speed (#618) (eaf73fd)
change minimal kv_chunk_size back to 128 (#329) (f237f5f)
Dense and sparse customizable flashattention-3 template (#667) (51236c9)
faster fp8->fp16 dequantization for pre sm_90 arch (#439) (c93f647)
fix prefill kernel performance degradation (step 1) (#602) (595cf60)
fix the performance issue of append_paged_kv_cache (#588) (e15f7c9)
improve parallelism in RoPE with pos_ids (#609) (ff05155)
improve plan performance by using non-blocking memcpy (#547) (41ebe6d)
initial cuda graph support (#256) (7e9cc7f)
more options for kv tile size (#336) (bf2a6c7)
multiple q by sm_scale in decode kernels (#144) (660c559)
Optimize tensor conversions in C++ code to avoid unnecessary copies (#366) (1116237)
reduce the read and write of shared memory in the FusedAddRMSNormKernel (#592) (2043ca2)
reduce total_num_tiles_q by one (#644) (553ace5)
remove unnecessary contiguous operation in block sparse attention (#561) (7a7ad46)
slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
slight optimization on fragment layout swizzle (#458) (7c397cb)
slight optimization on merge states (#313) (701c813)
speedup jit compilation of prefill attention kernels (#632) (a059586)
split kv-cache for prefill/append kernels (#310) (f0bb0a3)
use 1x4 warp layout for small query length (#322) (4e89b4d)
use cuda-core implemention for io-bound block-sparse attention (#560) (3fbf028)
use packed bit array for attention mask (#308) (3d43dc9)
use persistent kernel for merging attention states (#459) (be6bf5b)
use stmatrix in epilogue for sm90+ (#380) (c6f20d1)

This PR was generated with Release Please. See documentation.

yzh119 · 2024-12-25T08:03:42Z

Seems you messed up again, let's find a LLM-based tool instead :)

chore(main): release 0.3.0

d5015b2

github-actions bot added the autorelease: pending label Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(main): release 0.3.0 #698

chore(main): release 0.3.0 #698

github-actions bot commented Dec 25, 2024

yzh119 commented Dec 25, 2024

chore(main): release 0.3.0 #698

Are you sure you want to change the base?

chore(main): release 0.3.0 #698

Conversation

github-actions bot commented Dec 25, 2024

🤖 I have created a release beep boop

0.3.0 (2024-12-25)

Features

Bug Fixes

Performance Improvements

yzh119 commented Dec 25, 2024