You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.
Currently vLLM uses PYBIND11_MODULE macro to bind C++/CUDA to Python with the binding code being found in csrc/pybind.cpp. This means calls to this kernel bypasses the torch dispatcher (more information on the torch dispatcher can be found here and here). While bypassing the torch dispatcher works, using the torch dispatcher has a few distinct advantages, namely:
Better integration with the Pytorch profiler
A more natural way to support CPU only inference or other hardware in the future
With regards to 1, at Neural Magic we are working on more indepth profiling tools within vLLM using the Pytorch profiler, by using torch dispatcher (i.e. registering the C++/CUDA kernels using TORCH_LIBRARY macro instead of we can PYBIND11_MODULE) we can provide richer traces since the profiler will be able to capture metadata (namely type and shape information) for the inputs to each operation (kernel). Below is an example of the traces we are generating (Note this is work in progress):
Under the final trace column we can see tensor type and shape information, however this information is only available for TorchOp events (i.e. kernels registered using TORCH_LIBRARY). For example flash_fwd_kernel and ampere_bf16_s16816gemm... has this shape an type information while vllm::reshape_and_cache_kernel does not, as the former two kernels go through the torch dispatcher why the latter does not.
With regards to 2, at Neural Magic we have ambitions to extend vLLM to support CPU inference which will require dispatching to CPU or CUDA versions of the same kernel depending on the location of tensors (this can apply to other hardware too, not just CPUs), this is something the torch dispatcher does this automatically alleviating the need for a chain of if statements.
Implementation
There appears to be 2 primary ways to register operations (kernels) with the torch dispatcher, the first is using C++ and the TORCH_LIBRARY macro mentioned in the motivation. An example of this can be found in the xformers repository, with an SpMM operation being declared here and implementation being bound for CUDA here and CPU here. The other way is via Python, xformers also has an example of this for the flash_fwd operation with the operation declaration being found here and the CUDA implementation being bound here.
For implementation given that vLLM controls the Python to C++/CUDA bindings for the kernels in csrc I think it would be cleaner to go with the TORCH_LIBRARY approach as it wouldn't require much more boiler plate than the existing PYBIND11_MODULE.
The text was updated successfully, but these errors were encountered:
Anything you want to discuss about vllm.
Motivation
Currently vLLM uses
PYBIND11_MODULE
macro to bind C++/CUDA to Python with the binding code being found incsrc/pybind.cpp
. This means calls to this kernel bypasses the torch dispatcher (more information on the torch dispatcher can be found here and here). While bypassing the torch dispatcher works, using the torch dispatcher has a few distinct advantages, namely:With regards to 1, at Neural Magic we are working on more indepth profiling tools within vLLM using the Pytorch profiler, by using torch dispatcher (i.e. registering the C++/CUDA kernels using
TORCH_LIBRARY
macro instead of we canPYBIND11_MODULE
) we can provide richer traces since the profiler will be able to capture metadata (namely type and shape information) for the inputs to each operation (kernel). Below is an example of the traces we are generating (Note this is work in progress):Under the final trace column we can see tensor type and shape information, however this information is only available for TorchOp events (i.e. kernels registered using
TORCH_LIBRARY
). For exampleflash_fwd_kernel
andampere_bf16_s16816gemm...
has this shape an type information whilevllm::reshape_and_cache_kernel
does not, as the former two kernels go through the torch dispatcher why the latter does not.With regards to 2, at Neural Magic we have ambitions to extend vLLM to support CPU inference which will require dispatching to CPU or CUDA versions of the same kernel depending on the location of tensors (this can apply to other hardware too, not just CPUs), this is something the torch dispatcher does this automatically alleviating the need for a chain of if statements.
Implementation
There appears to be 2 primary ways to register operations (kernels) with the torch dispatcher, the first is using C++ and the
TORCH_LIBRARY
macro mentioned in the motivation. An example of this can be found in the xformers repository, with an SpMM operation being declared here and implementation being bound for CUDA here and CPU here. The other way is via Python, xformers also has an example of this for theflash_fwd
operation with the operation declaration being found here and the CUDA implementation being bound here.For implementation given that vLLM controls the Python to C++/CUDA bindings for the kernels in
csrc
I think it would be cleaner to go with theTORCH_LIBRARY
approach as it wouldn't require much more boiler plate than the existingPYBIND11_MODULE
.The text was updated successfully, but these errors were encountered: