Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Adding optional cpu_indptr/cpu_qo_indptr parameter to plan method to avoid synchronized device to host copy. #565

Open
reyoung opened this issue Oct 28, 2024 · 3 comments

Comments

@reyoung
Copy link
Contributor

reyoung commented Oct 28, 2024

The current plan methods appear to include unnecessary device-to-CPU copies, as seen in the following links:

indptr_host = indptr.to("cpu")

# NOTE(Zihao): only required if qo_indptr/paged_kv_indptr are device tensors
qo_indptr_host = qo_indptr.to("cpu")
paged_kv_indptr_host = paged_kv_indptr.to("cpu")

These copies are unnecessary because qo_indptr and append_indptr should be initialized during data preparation and are originally on the CPU. Therefore, the plan method could accept an optional indptr parameter. The tensor.to("cpu") operation should only be invoked if this indptr parameter is None.

@yzh119
Copy link
Collaborator

yzh119 commented Oct 28, 2024

Hi @reyoung , you are right, if the indptr arrays are on CPU, these operations are not necessary. However, we didn't note that these arrays should be on CPU and some of the frameworks that integrates flashinfer have legacy issues of using GPU tensors, such conversion is mainly for backward compatibility.

If the tensors are already on CPU, then the indptr_host = indptr.to("cpu") is a no-op and will not result in a copy.

>>> import torch
>>> x = torch.rand(12)
>>> x.data_ptr()
94804334735872
>>> y = x.to("cpu")
>>> y.data_ptr()
94804334735872

@reyoung
Copy link
Contributor Author

reyoung commented Oct 28, 2024

However, when indptr is CPU tensor, an unnecessary CPU -> Device copy will be invoked here

self._qo_indptr_buf = qo_indptr.to(self.device, non_blocking=True)

It seems that the current API cannot avoid cross device copy.

@reyoung
Copy link
Contributor Author

reyoung commented Oct 28, 2024

I found the kv_indptr should always be on CPU, and qo_indptr is used in both CPU/CUDA.

Maybe it is better to only add cpu_qo_indptr parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants