-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops #5047
Conversation
Is it possible to directly change the bind system to solve #4694 ? |
I'm not sure but I can look into that. I don't think torch uses pybind internally for registering operations so it might be possible. |
I did a little investigating and the |
I can make this change. Anything else? |
BTW, what is the initial motivation of changing the binding system to |
This wasn't quite as bad as I thought. I was able to get the custom_ar ops registered with the following changes to the API, so I think I can make the implementation changes also.
|
I am working on a custom backed for torch.compile that will be used to do optimizations on models, e.g. epilog fusion, quantization, etc. I was running into issues with the custom operators that turned out to be solvable by using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, Bill. Thanks for working on this.
cc @WoosukKwon , @youkaichao , @pcmoritz , @mgoin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, my initial reuqest was to build python agnostic wheels, and this PR achieves the goal, so I give my approval. Please wait for the rest people to comment on the other pespectives.
No problem. Thanks for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks for doing this! Before merging, can you:
- Rename pybind to something that makes it clearer that TORCH_LIBRARY is being used? E.g. torch_library_bindings
- Investigate if there is an impact on the performance (e.g. by changing float to double and int to int64_t -- btw, why is this needed, do you have some references about this)
Pytorch only supports double and int64_t for scalars. The note is buried in this link: https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html. Also, python represents its float types as C/C++ double precision. See https://docs.python.org/3/library/stdtypes.html I can double check performance but I imagine that pybind11 was already downcasting python double-floats to C floats. |
* upstream/main: (126 commits) [Bugfix][Frontend] Cleanup "fix chat logprobs" (vllm-project#5026) [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (vllm-project#5312) [Misc] Various simplifications and typing fixes (vllm-project#5368) [ci] Fix Buildkite agent path (vllm-project#5392) [Doc] Add documentation for FP8 W8A8 (vllm-project#5388) Bump version to v0.5.0 (vllm-project#5384) [Docs] Alphabetically sort sponsors (vllm-project#5386) [Docs] Add Docs on Limitations of VLM Support (vllm-project#5383) [ci] Mount buildkite agent on Docker container to upload benchmark results (vllm-project#5330) [ci] Use small_cpu_queue for doc build (vllm-project#5331) [Bugfix] Fix LLaVA-NeXT (vllm-project#5380) [Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (vllm-project#5319) [Model] Initial support for LLaVA-NeXT (vllm-project#4199) [Misc] Improve error message when LoRA parsing fails (vllm-project#5194) [misc][typo] fix typo (vllm-project#5372) [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (vllm-project#5374) [Misc] Update to comply with the new `compressed-tensors` config (vllm-project#5350) [Bugfix] Fix KeyError: 1 When Using LoRA adapters (vllm-project#5164) [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (vllm-project#5047) [mis][ci/test] fix flaky test in test_sharded_state_loader.py (vllm-project#5361) ...
[rank0]: torch.ops._C.rms_norm(out, input, weight, epsilon) when build the vllm from source, this error happens, how to fix it? |
Hi @changyuanzhangchina , can you provide some more context? Is this from a clean build area? What were you running when you got this error? |
hi, we have found pip install -e . works well. while python setup.py install (develop) can incur the error. thanks. |
Hi @bnellnm, we find there's a small performance regression(5%) with this commit on CPU backend:
The profile data didn't show big time increase in the those kernels so I think the overhead may due to the API change itself. Not sure if this also impact the CUDA backend performance, do you happen to have some ideas on this issue? thanks, -yuan |
@zhouyuan we can try to run this down. Can you post repro instructions for your performance benchmark (scripts and hardware setup)? |
I didn't see any regressions in the CUDA based benchmarks. I don't know why this would only affect CPU performance. |
@tlrmchlsmth @bnellnm thanks for the quick check! The throughput tests(llama2-7b) are running on a 56c SPR server(4th gen Xeon with AMX instructions, link) with Indeed this is an odd issue as from my understanding the tensor does not get changed, only some constants type change(int -> int64_t). Will try to collect more profile data and update back. thanks, -yuan |
Regarding the speed issue, I have a library that has a c++ extention that was linked to python using |
…or custom ops (vllm-project#5047)" This reverts commit 5467ac3.
This PR makes the following changes
PYBIND11_MODULE
withTORCH_LIBRARY
and adds schemas and meta functions (where) needed for all custom (C++/CUDA) kernels.Motivation
TORCH_LIBRARY
is the more official way of defining custom pytorch ops and will also help support use of torch.compile on vllm models. See: [Misc]: Move from usingPYBIND11_MODULE
macro to bind C++/CUDA kernels to python to usingTORCH_LIBRARY
macro neuralmagic/nm-vllm#133Details
Tensor!
indicates modification of an input argument. I did my best to get these correct but it is possible I missed a few, so these could use careful review. I don't think this will be much of an issue untiltorch.compile
starts to be used._custom_ops.py
and added a function to check for support of a particular kernel based on name (is_custom_op_supported
) so optional extensions can be checked at runtime.import
the extension shared libs rather than usetorch.ops.load_library
since that requires a platform and abi specific filename.get_graph_buffer_ipc_meta
needed to be changed so that it could be registered viaTORCH_LIBRARY
. The handles were changed to atorch::Tensor
rather than astd::vector<uint8_t>
. I'm not quite sure of the device registration for thecustom_ar
ops. I don't know if they should be CPU or CUDA (or both).