Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(main): release 0.3.0 #698

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

github-actions[bot]
Copy link
Contributor

🤖 I have created a release beep boop

0.3.0 (2024-12-25)

Features

  • add MultiLevelCascadeAttentionWrapper API (#462) (1e37989)
  • add rotary_dim argument to rope APIs for partial apply rope (#599) (eb9bc71)
  • add use_tensor_cores option to decode kernels to accelerate GQA (#317) (3b50dd5)
  • add a use_softmax field in variant class (#533) (d81af97)
  • add accept num, emit num metric for ChainSpeculativeSampling (#450) (fa38b5e)
  • add an option non_blocking to plan function (#622) (560af6f)
  • add gelu_and_mul (#474) (9ee26e7)
  • add gemma_rmsnorm and gemma_fused_add_rmsnorm (#477) (1a6b17e)
  • add group gemm operators (#282) (e08ba42)
  • add group size 3 to GQA decode dispatch (#558) (6227562)
  • add JIT compilation support for FA3 templates (#672) (d4e8d79)
  • add llama 3.1 style rope (#401) (4c89dec)
  • Add mask to merge_state_in_place (#372) (e14fa81)
  • add mma instructions for fp8 (#179) (d305798)
  • adding sm_scale field for all attention APIs (#145) (85d4018)
  • allow the cascade kernels to be executed using varying sequence lenghts (#627) (92ac440)
  • append attention kernels for fp8 kv-cache (#420) (906c2f5)
  • CUDAGraph compatibility of multi-level cascade inference APIs (#586) (2332e8a)
  • customize logits_soft_cap value (#339) (a2498f5)
  • decouple float and int workspace buffer (#442) (a7ee566)
  • deterministic sampling (#417) (0dd801d)
  • enable head_dim=256 for attention kernels (#132) (0372acc)
  • expose decoupled kv-cache to pytorch api (#383) (457a0ae)
  • expose pytorch api for block sparse attention (#375) (4bba6fa)
  • fix the maximal grid dimension in prefill planning with CUDA graphs (#639) (86ca89a)
  • Fused GPU sampling kernel for joint top-k & top-p sampling (#374) (6e028eb)
  • improve the precision of the FusedAddRMSNormKernel function (#587) (c7dc921)
  • initial support of distributed operators (#289) (03553da)
  • initial support of logits hook (#298) (ab1e2ad)
  • JIT compilation (#507) (3613a5b)
  • mma rowsum for fp8 (#180) (5af935c)
  • modify group-gemm stage number (#497) (52dab1d)
  • more sampling operator options (#431) (68df9c4)
  • non-contiguous query with paged kv cache (#553) (89f2c4a)
  • non-inplace rope operators (#405) (74ffba1)
  • pass a dynamic token count to the cascade kernels (#635) (5fe9f7d)
  • pytorch api of fp8 kv-cache (#156) (66ee066)
  • Separate Q and KV dtypes for decode (#286) (5602659)
  • simplify prefill JIT compilation (#605) (fe4f898)
  • sliding window attention (#406) (28cffd3)
  • specify gemm backend (#648) (0cc1a51)
  • support ALiBi (#146) (383518b)
  • support any num_heads for get_alibi_slope (#200) (b217a6f)
  • support bmm fp8 (#469) (f1c0b68)
  • support cached cos/sin in rope APIs (#585) (83e541d)
  • support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
  • support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
  • support custom attention mask in prefill/append attention kernels (#266) (7304282)
  • support fused add rmsnorm (#419) (b781513)
  • support fused gelu tanh mul (#434) (2c9d1c3)
  • support fused silu mul (#427) (ea0ba9a)
  • support huggingface transformer style rope interface (#568) (4f40420)
  • support non-contiguous (packed) input for prefill kernels (#404) (68c3719)
  • support sm90 cutlass group gemm (#509) (794bdda)
  • torch custom_op fix for rope (#569) (3e104bc)
  • torch custom_op support: norm (#552) (f6e0010)
  • torch.compile and custom_op support (#554) (9bf916f)
  • warmup for jit kernel tests (#629) (8f5f349)

Bug Fixes

Performance Improvements

  • accelerate alibi (#365) (4f0a9f9)
  • accelerate gqa performance (#356) (e56ddad)
  • accelerate JIT compilation speed (#618) (eaf73fd)
  • change minimal kv_chunk_size back to 128 (#329) (f237f5f)
  • Dense and sparse customizable flashattention-3 template (#667) (51236c9)
  • faster fp8->fp16 dequantization for pre sm_90 arch (#439) (c93f647)
  • fix prefill kernel performance degradation (step 1) (#602) (595cf60)
  • fix the performance issue of append_paged_kv_cache (#588) (e15f7c9)
  • improve parallelism in RoPE with pos_ids (#609) (ff05155)
  • improve plan performance by using non-blocking memcpy (#547) (41ebe6d)
  • initial cuda graph support (#256) (7e9cc7f)
  • more options for kv tile size (#336) (bf2a6c7)
  • multiple q by sm_scale in decode kernels (#144) (660c559)
  • Optimize tensor conversions in C++ code to avoid unnecessary copies (#366) (1116237)
  • reduce the read and write of shared memory in the FusedAddRMSNormKernel (#592) (2043ca2)
  • reduce total_num_tiles_q by one (#644) (553ace5)
  • remove unnecessary contiguous operation in block sparse attention (#561) (7a7ad46)
  • slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
  • slight optimization on fragment layout swizzle (#458) (7c397cb)
  • slight optimization on merge states (#313) (701c813)
  • speedup jit compilation of prefill attention kernels (#632) (a059586)
  • split kv-cache for prefill/append kernels (#310) (f0bb0a3)
  • use 1x4 warp layout for small query length (#322) (4e89b4d)
  • use cuda-core implemention for io-bound block-sparse attention (#560) (3fbf028)
  • use packed bit array for attention mask (#308) (3d43dc9)
  • use persistent kernel for merging attention states (#459) (be6bf5b)
  • use stmatrix in epilogue for sm90+ (#380) (c6f20d1)

This PR was generated with Release Please. See documentation.

@yzh119
Copy link
Collaborator

yzh119 commented Dec 25, 2024

Seems you messed up again, let's find a LLM-based tool instead :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant