-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Graph support for prefill kernels with varying qo_indptr
#626
Comments
nandor
added a commit
to nandor/flashinfer
that referenced
this issue
Nov 21, 2024
The cascade kernels can take a dynamic sequence length in order to allow the number of tokens to vary when executed under CUDA graphs. This is the first step towards implementing CUDA graph support for arbitrary `qo_indptr` contents, as tracked by flashinfer-ai#626.
nandor
added a commit
to nandor/flashinfer
that referenced
this issue
Nov 21, 2024
The cascade kernels can take a dynamic sequence length in order to allow the number of tokens to vary when executed under CUDA graphs. This is the first step towards implementing CUDA graph support for arbitrary `qo_indptr` contents, as tracked by flashinfer-ai#626.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Presently, the sequence lengths from
qo_indptr
determine kernel parameters which are supposed to be frozen for CUDA graph.This even includes
split_kv
.In particular,
total_num_tiles_q
depends on the contents ofqo_indptr
, not only its shape:flashinfer/include/flashinfer/attention/scheduler.cuh
Line 475 in 9cba9fb
The desire is to fix the batch size (and implicitly the number of elements in
qo_indptr
), while varying the actual sequence among a fixed number tokens thatqo_indptr
points to. For example, the same CUDA graph should be able to process two prefill requests with a total a sum 2048 and another set of prefill requests whose lengths sum up to anything less than 2048.When CUDA graphs are enabled, these parameters should be hooked to an upper bound and the actual values should be passed on dynamically.
The text was updated successfully, but these errors were encountered: