Fix the maximal grid dimension in prefill planning with CUDA graphs #639

nandor · 2024-11-25T09:38:01Z

Previously, differences in the contents of qo_indptr could lead to block sizes varying across CUDA graph invocations, leading to illegal memory accessed.

This PR alters the calculation of the block size to find a reasonable maximum based on the longest sequence.

The maximum token count is fixed in plan on the Python side and passed along to scheduler.cuh to derive the other parameters.

While this ensures correctness under CUDA graphs, when CUDA graphs are enabled split-kv is now always used, potentially degrading performance if CUDA graphs are to be used with fixed qo_indptr. However, for varying qo_indptr, CUDA graphs deliver 4x performance improvements for prefill on models such as Llama 3.2-1B.

yzh119

Thanks for doing this and I also expect this PR to benefit speculative decoding as well.

Left some comments for readability.

python/flashinfer/prefill.py

python/flashinfer/decode.py

src/test_batch_prefill.cu

src/test_cascade.cu

src/bench_cascade.cu

src/bench_batch_prefill.cu

src/bench_batch_decode.cu

Previously, differences in the contents of qo_indptr could lead to block sizes varying across CUDA graph invocations, leading to illegal memory accessed. This PR alters the calculation of the block size to find a reasonable maximum based on the longest sequence. The maximum token count is fixed in `plan` on the `Python` side and passed along to `scheduler.cuh` to derive the other parameters.

yzh119

LGTM, much appreciated!

🤖 I have created a release *beep* *boop* --- ## [0.2.0](v0.1.6...v0.2.0) (2024-12-17) [Release Blog](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html). ### Features * add `rotary_dim` argument to rope APIs for partial apply rope ([#599](#599)) ([eb9bc71](eb9bc71)) * add a `use_softmax` field in variant class ([#533](#533)) ([d81af97](d81af97)) * add an option `non_blocking` to plan function ([#622](#622)) ([560af6f](560af6f)) * add gemma_rmsnorm and gemma_fused_add_rmsnorm ([#477](#477)) ([1a6b17e](1a6b17e)) * add group size 3 to GQA decode dispatch ([#558](#558)) ([6227562](6227562)) * add JIT compilation support for FA3 templates ([#672](#672)) ([d4e8d79](d4e8d79)) * allow the cascade kernels to be executed using varying sequence lenghts ([#627](#627)) ([92ac440](92ac440)) * CUDAGraph compatibility of multi-level cascade inference APIs ([#586](#586)) ([2332e8a](2332e8a)) * fix the maximal grid dimension in prefill planning with CUDA graphs ([#639](#639)) ([86ca89a](86ca89a)) * improve the precision of the FusedAddRMSNormKernel function ([#587](#587)) ([c7dc921](c7dc921)) * JIT compilation ([#507](#507)) ([3613a5b](3613a5b)) * modify group-gemm stage number ([#497](#497)) ([52dab1d](52dab1d)) * non-contiguous query with paged kv cache ([#553](#553)) ([89f2c4a](89f2c4a)) * pass a dynamic token count to the cascade kernels ([#635](#635)) ([5fe9f7d](5fe9f7d)) * simplify prefill JIT compilation ([#605](#605)) ([fe4f898](fe4f898)) * specify gemm backend ([#648](#648)) ([0cc1a51](0cc1a51)) * support cached cos/sin in rope APIs ([#585](#585)) ([83e541d](83e541d)) * support huggingface transformer style rope interface ([#568](#568)) ([4f40420](4f40420)) * support sm90 cutlass group gemm ([#509](#509)) ([794bdda](794bdda)) * torch custom_op fix for rope ([#569](#569)) ([3e104bc](3e104bc)) * torch custom_op support: norm ([#552](#552)) ([f6e0010](f6e0010)) * torch.compile and custom_op support ([#554](#554)) ([9bf916f](9bf916f)) * warmup for jit kernel tests ([#629](#629)) ([8f5f349](8f5f349)) ### Bug Fixes * AOT compiler flags on non-sm90 ([#522](#522)) ([0aa4726](0aa4726)) * batch decode kernel redundant store output to gmem ([#505](#505)) ([90e42a7](90e42a7)) * compatible with torch 2.2 ([#478](#478)) ([ac41d1b](ac41d1b)) * #452 ([b53a46f](b53a46f)) * remove redundant load ([#495](#495)) ([2de16b0](2de16b0)) * update bmm fp8 test ([#487](#487)) ([45eac04](45eac04)) ### Performance Improvements * accelerate JIT compilation speed ([#618](#618)) ([eaf73fd](eaf73fd)) * Dense and sparse customizable flashattention-3 template ([#667](#667)) ([51236c9](51236c9)) * fix prefill kernel performance degradation (step 1) ([#602](#602)) ([595cf60](595cf60)) * fix the performance issue of `append_paged_kv_cache` ([#588](#588)) ([e15f7c9](e15f7c9)) * improve parallelism in RoPE with pos_ids ([#609](#609)) ([ff05155](ff05155)) * improve plan performance by using non-blocking memcpy ([#547](#547)) ([41ebe6d](41ebe6d)) * reduce the read and write of shared memory in the FusedAddRMSNormKernel ([#592](#592)) ([2043ca2](2043ca2)) * reduce total_num_tiles_q by one ([#644](#644)) ([553ace5](553ace5)) * remove unnecessary contiguous operation in block sparse attention ([#561](#561)) ([7a7ad46](7a7ad46)) * speedup jit compilation of prefill attention kernels ([#632](#632)) ([a059586](a059586)) * use cuda-core implemention for io-bound block-sparse attention ([#560](#560)) ([3fbf028](3fbf028)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>

yzh119 reviewed Nov 25, 2024

View reviewed changes

nandor force-pushed the nandor/prefill branch from 93155ca to 46e8cde Compare November 25, 2024 14:14

nandor marked this pull request as draft November 25, 2024 15:30

nandor force-pushed the nandor/prefill branch from 46e8cde to d89b23e Compare November 25, 2024 18:35

nandor marked this pull request as ready for review November 25, 2024 18:36

yzh119 approved these changes Nov 25, 2024

View reviewed changes

yzh119 merged commit 86ca89a into flashinfer-ai:main Nov 25, 2024

github-actions bot mentioned this pull request Nov 25, 2024

chore(main): release 0.2.0 #476

Merged

yzh119 mentioned this pull request Nov 25, 2024

CUDA Graph support for prefill kernels with varying qo_indptr #626

Closed

github-actions bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the maximal grid dimension in prefill planning with CUDA graphs #639

Fix the maximal grid dimension in prefill planning with CUDA graphs #639

nandor commented Nov 25, 2024

yzh119 left a comment

yzh119 left a comment

Fix the maximal grid dimension in prefill planning with CUDA graphs #639

Fix the maximal grid dimension in prefill planning with CUDA graphs #639

Conversation

nandor commented Nov 25, 2024

yzh119 left a comment

Choose a reason for hiding this comment

yzh119 left a comment

Choose a reason for hiding this comment