CUDA Graph support for prefill kernels with varying `qo_indptr` #626

nandor · 2024-11-21T21:04:55Z

Presently, the sequence lengths from qo_indptr determine kernel parameters which are supposed to be frozen for CUDA graph.
This even includes split_kv.

In particular, total_num_tiles_q depends on the contents of qo_indptr, not only its shape:

flashinfer/include/flashinfer/attention/scheduler.cuh

Line 475 in 9cba9fb

uint32_t total_num_tiles_q = 0;

.

The desire is to fix the batch size (and implicitly the number of elements in qo_indptr), while varying the actual sequence among a fixed number tokens that qo_indptr points to. For example, the same CUDA graph should be able to process two prefill requests with a total a sum 2048 and another set of prefill requests whose lengths sum up to anything less than 2048.

When CUDA graphs are enabled, these parameters should be hooked to an upper bound and the actual values should be passed on dynamically.

The text was updated successfully, but these errors were encountered:

The cascade kernels can take a dynamic sequence length in order to allow the number of tokens to vary when executed under CUDA graphs. This is the first step towards implementing CUDA graph support for arbitrary `qo_indptr` contents, as tracked by flashinfer-ai#626.

… lenghts (#627) The cascade kernels can take a dynamic sequence length in order to allow the number of tokens to vary when executed under CUDA graphs. This is the first step towards implementing CUDA graph support for arbitrary `qo_indptr` contents, as tracked by #626.

yzh119 · 2024-11-25T20:17:19Z

#627, #635 and #639 should have fixed this issue, kudos to @nandor .

nandor mentioned this issue Nov 21, 2024

Allow the cascade kernels to be executed using varying sequence lenghts #627

Merged

yzh119 closed this as completed Nov 25, 2024

yzh119 mentioned this issue Dec 4, 2024

BatchPrefillWithPagedKVCacheWrapper has performance degradation when setting use_cuda_graph=True #411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Graph support for prefill kernels with varying `qo_indptr` #626

CUDA Graph support for prefill kernels with varying `qo_indptr` #626

nandor commented Nov 21, 2024

yzh119 commented Nov 25, 2024

CUDA Graph support for prefill kernels with varying qo_indptr #626

CUDA Graph support for prefill kernels with varying qo_indptr #626

Comments

nandor commented Nov 21, 2024

yzh119 commented Nov 25, 2024

CUDA Graph support for prefill kernels with varying `qo_indptr` #626

CUDA Graph support for prefill kernels with varying `qo_indptr` #626