Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Graph support for prefill kernels with varying qo_indptr #626

Closed
nandor opened this issue Nov 21, 2024 · 1 comment
Closed

CUDA Graph support for prefill kernels with varying qo_indptr #626

nandor opened this issue Nov 21, 2024 · 1 comment

Comments

@nandor
Copy link
Contributor

nandor commented Nov 21, 2024

Presently, the sequence lengths from qo_indptr determine kernel parameters which are supposed to be frozen for CUDA graph.
This even includes split_kv.

In particular, total_num_tiles_q depends on the contents of qo_indptr, not only its shape:

uint32_t total_num_tiles_q = 0;
.

The desire is to fix the batch size (and implicitly the number of elements in qo_indptr), while varying the actual sequence among a fixed number tokens that qo_indptr points to. For example, the same CUDA graph should be able to process two prefill requests with a total a sum 2048 and another set of prefill requests whose lengths sum up to anything less than 2048.

When CUDA graphs are enabled, these parameters should be hooked to an upper bound and the actual values should be passed on dynamically.

nandor added a commit to nandor/flashinfer that referenced this issue Nov 21, 2024
The cascade kernels can take a dynamic sequence length in order to allow
the number of tokens to vary when executed under CUDA graphs.

This is the first step towards implementing CUDA graph support for arbitrary `qo_indptr` contents, as tracked by flashinfer-ai#626.
nandor added a commit to nandor/flashinfer that referenced this issue Nov 21, 2024
The cascade kernels can take a dynamic sequence length in order to allow
the number of tokens to vary when executed under CUDA graphs.

This is the first step towards implementing CUDA graph support for arbitrary `qo_indptr` contents, as tracked by flashinfer-ai#626.
yzh119 pushed a commit that referenced this issue Nov 23, 2024
… lenghts (#627)

The cascade kernels can take a dynamic sequence length in order to allow
the number of tokens to vary when executed under CUDA graphs.

This is the first step towards implementing CUDA graph support for
arbitrary `qo_indptr` contents, as tracked by #626.
@yzh119
Copy link
Collaborator

yzh119 commented Nov 25, 2024

#627, #635 and #639 should have fixed this issue, kudos to @nandor .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants