feat: add `rotary_dim` argument to rope APIs for partial apply rope #599

yzh119 · 2024-11-10T05:45:04Z

This PR implements the final piece of #530 , so that we can partially apply rotary embedding to first head dimensions instead of entire head dimensions.

We also add a simple benchmark for RoPE, below is the result on H100:

batch_size:   1, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 23us, throughput:   0.876GB/s
batch_size:   1, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 26us, throughput:   0.801GB/s
batch_size:   1, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  95.735GB/s
batch_size:   1, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  95.639GB/s
batch_size:   1, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 31us, throughput: 672.889GB/s
batch_size:   1, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 32us, throughput: 662.972GB/s
---
batch_size:  19, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  14.559GB/s
batch_size:  19, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  14.435GB/s
batch_size:  19, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 37us, throughput: 1339.450GB/s
batch_size:  19, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 37us, throughput: 1340.399GB/s
batch_size:  19, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 148us, throughput: 2696.563GB/s
batch_size:  19, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 148us, throughput: 2689.104GB/s
---
batch_size:  99, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  74.186GB/s
batch_size:  99, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  74.452GB/s
batch_size:  99, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 110us, throughput: 2350.830GB/s
batch_size:  99, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 110us, throughput: 2359.814GB/s
batch_size:  99, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 717us, throughput: 2895.389GB/s
batch_size:  99, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 718us, throughput: 2891.385GB/s
---
batch_size: 128, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  95.449GB/s
batch_size: 128, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  95.646GB/s
batch_size: 128, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 130us, throughput: 2576.101GB/s
batch_size: 128, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 130us, throughput: 2582.447GB/s
batch_size: 128, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 924us, throughput: 2906.154GB/s
batch_size: 128, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 925us, throughput: 2903.484GB/s

The TVM wrapper was broken in #599 because of API changes, this PR fixes the issue.

🤖 I have created a release *beep* *boop* --- ## [0.2.0](v0.1.6...v0.2.0) (2024-12-17) [Release Blog](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html). ### Features * add `rotary_dim` argument to rope APIs for partial apply rope ([#599](#599)) ([eb9bc71](eb9bc71)) * add a `use_softmax` field in variant class ([#533](#533)) ([d81af97](d81af97)) * add an option `non_blocking` to plan function ([#622](#622)) ([560af6f](560af6f)) * add gemma_rmsnorm and gemma_fused_add_rmsnorm ([#477](#477)) ([1a6b17e](1a6b17e)) * add group size 3 to GQA decode dispatch ([#558](#558)) ([6227562](6227562)) * add JIT compilation support for FA3 templates ([#672](#672)) ([d4e8d79](d4e8d79)) * allow the cascade kernels to be executed using varying sequence lenghts ([#627](#627)) ([92ac440](92ac440)) * CUDAGraph compatibility of multi-level cascade inference APIs ([#586](#586)) ([2332e8a](2332e8a)) * fix the maximal grid dimension in prefill planning with CUDA graphs ([#639](#639)) ([86ca89a](86ca89a)) * improve the precision of the FusedAddRMSNormKernel function ([#587](#587)) ([c7dc921](c7dc921)) * JIT compilation ([#507](#507)) ([3613a5b](3613a5b)) * modify group-gemm stage number ([#497](#497)) ([52dab1d](52dab1d)) * non-contiguous query with paged kv cache ([#553](#553)) ([89f2c4a](89f2c4a)) * pass a dynamic token count to the cascade kernels ([#635](#635)) ([5fe9f7d](5fe9f7d)) * simplify prefill JIT compilation ([#605](#605)) ([fe4f898](fe4f898)) * specify gemm backend ([#648](#648)) ([0cc1a51](0cc1a51)) * support cached cos/sin in rope APIs ([#585](#585)) ([83e541d](83e541d)) * support huggingface transformer style rope interface ([#568](#568)) ([4f40420](4f40420)) * support sm90 cutlass group gemm ([#509](#509)) ([794bdda](794bdda)) * torch custom_op fix for rope ([#569](#569)) ([3e104bc](3e104bc)) * torch custom_op support: norm ([#552](#552)) ([f6e0010](f6e0010)) * torch.compile and custom_op support ([#554](#554)) ([9bf916f](9bf916f)) * warmup for jit kernel tests ([#629](#629)) ([8f5f349](8f5f349)) ### Bug Fixes * AOT compiler flags on non-sm90 ([#522](#522)) ([0aa4726](0aa4726)) * batch decode kernel redundant store output to gmem ([#505](#505)) ([90e42a7](90e42a7)) * compatible with torch 2.2 ([#478](#478)) ([ac41d1b](ac41d1b)) * #452 ([b53a46f](b53a46f)) * remove redundant load ([#495](#495)) ([2de16b0](2de16b0)) * update bmm fp8 test ([#487](#487)) ([45eac04](45eac04)) ### Performance Improvements * accelerate JIT compilation speed ([#618](#618)) ([eaf73fd](eaf73fd)) * Dense and sparse customizable flashattention-3 template ([#667](#667)) ([51236c9](51236c9)) * fix prefill kernel performance degradation (step 1) ([#602](#602)) ([595cf60](595cf60)) * fix the performance issue of `append_paged_kv_cache` ([#588](#588)) ([e15f7c9](e15f7c9)) * improve parallelism in RoPE with pos_ids ([#609](#609)) ([ff05155](ff05155)) * improve plan performance by using non-blocking memcpy ([#547](#547)) ([41ebe6d](41ebe6d)) * reduce the read and write of shared memory in the FusedAddRMSNormKernel ([#592](#592)) ([2043ca2](2043ca2)) * reduce total_num_tiles_q by one ([#644](#644)) ([553ace5](553ace5)) * remove unnecessary contiguous operation in block sparse attention ([#561](#561)) ([7a7ad46](7a7ad46)) * speedup jit compilation of prefill attention kernels ([#632](#632)) ([a059586](a059586)) * use cuda-core implemention for io-bound block-sparse attention ([#560](#560)) ([3fbf028](3fbf028)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>

yzh119 added 3 commits November 10, 2024 05:27

upd

4b5a4de

upd

1722ac6

add bench

2fb96b5

yzh119 merged commit eb9bc71 into main Nov 10, 2024

yzh119 mentioned this pull request Nov 10, 2024

Support vLLM-style rope #530

Closed

github-actions bot mentioned this pull request Nov 10, 2024

chore(main): release 0.2.0 #476

Merged

yzh119 deleted the rope-dim branch November 10, 2024 08:46

yzh119 mentioned this pull request Nov 10, 2024

hotfix: fix rope tvm wrapper #601

Merged

yzh119 added a commit that referenced this pull request Nov 10, 2024

hotfix: fix rope tvm wrapper (#601)

3dd9405

The TVM wrapper was broken in #599 because of API changes, this PR fixes the issue.

github-actions bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `rotary_dim` argument to rope APIs for partial apply rope #599

feat: add `rotary_dim` argument to rope APIs for partial apply rope #599

yzh119 commented Nov 10, 2024 •

edited

Loading

feat: add rotary_dim argument to rope APIs for partial apply rope #599

feat: add rotary_dim argument to rope APIs for partial apply rope #599

Conversation

yzh119 commented Nov 10, 2024 • edited Loading

feat: add `rotary_dim` argument to rope APIs for partial apply rope #599

feat: add `rotary_dim` argument to rope APIs for partial apply rope #599

yzh119 commented Nov 10, 2024 •

edited

Loading