Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops #5047

Merged
merged 41 commits into from
Jun 9, 2024

Conversation

bnellnm
Copy link
Contributor

@bnellnm bnellnm commented May 25, 2024

This PR makes the following changes

  • replaces the uses of PYBIND11_MODULE with TORCH_LIBRARY and adds schemas and meta functions (where) needed for all custom (C++/CUDA) kernels.
  • The remainder of the custom operators are wrapped in _custom_ops.py, i.e. cuda_utils, cache_ops, moe, custom_ar and punica.
  • The code + libraries are made to use the python stable api. See [Feature]: bind python and c++ through tools other than pybind11 #4694

Motivation

Details

  • The signatures of a number of kernels needed to be modified to adhere to pytorch registration, e.g. int -> int64_t, float -> double
  • Explicit function schemas have been added for kernels that modify their inputs, e.g. Tensor! indicates modification of an input argument. I did my best to get these correct but it is possible I missed a few, so these could use careful review. I don't think this will be much of an issue until torch.compile starts to be used.
  • Any new kernels will need to be registered properly. Generally, kernels that modify their inputs will require a hand written schema (they can't be properly inferred just by the C++ signature). And kernels that return Tensors will require a meta function. The meta functions can simply return an empty Tensor that has the proper shape, type and device. If a meta function is not provided then the operation will cause a graph break when torch compiling.
  • I've wrapped all the kernels in _custom_ops.py and added a function to check for support of a particular kernel based on name (is_custom_op_supported) so optional extensions can be checked at runtime.
  • I've kept the ability to import the extension shared libs rather than use torch.ops.load_library since that requires a platform and abi specific filename.
  • The implementation and signature of get_graph_buffer_ipc_meta needed to be changed so that it could be registered via TORCH_LIBRARY. The handles were changed to a torch::Tensor rather than a std::vector<uint8_t>. I'm not quite sure of the device registration for the custom_ar ops. I don't know if they should be CPU or CUDA (or both).

@youkaichao
Copy link
Member

Is it possible to directly change the bind system to solve #4694 ?

@bnellnm
Copy link
Contributor Author

bnellnm commented May 29, 2024

Is it possible to directly change the bind system to solve #4694 ?

I'm not sure but I can look into that. I don't think torch uses pybind internally for registering operations so it might be possible.

@bnellnm
Copy link
Contributor Author

bnellnm commented May 30, 2024

Is it possible to directly change the bind system to solve #4694 ?

I'm not sure but I can look into that. I don't think torch uses pybind internally for registering operations so it might be possible.

I did a little investigating and the custom_ar would be a problem since the api uses types that TORCH_LIBRARY doesn't support, e.g. std::vector of int64_t/std::string/uint8_t. Maybe some of these could be changed to torch::Tensor instead?

@youkaichao
Copy link
Member

I did a little investigating and the custom_ar would be a problem since the api uses types that TORCH_LIBRARY doesn't support, e.g. std::vector of int64_t/std::string/uint8_t. Maybe some of these could be changed to torch::Tensor instead?

I can make this change.

Anything else?

@youkaichao
Copy link
Member

BTW, what is the initial motivation of changing the binding system to TORCH_LIBRARY?

@bnellnm
Copy link
Contributor Author

bnellnm commented May 31, 2024

I did a little investigating and the custom_ar would be a problem since the api uses types that TORCH_LIBRARY doesn't support, e.g. std::vector of int64_t/std::string/uint8_t. Maybe some of these could be changed to torch::Tensor instead?

I can make this change.

Anything else?

This wasn't quite as bad as I thought. I was able to get the custom_ar ops registered with the following changes to the API, so I think I can make the implementation changes also.

using fptr_t = int64_t;
fptr_t init_custom_ar(torch::Tensor& meta, torch::Tensor& rank_data,
                      const std::vector<std::string>& handles,
                      const std::vector<int64_t>& offsets, int64_t rank,
                      bool full_nvlink);
bool should_custom_ar(torch::Tensor& inp, int64_t max_size, int64_t world_size,
                      bool full_nvlink);
void all_reduce_reg(fptr_t _fa, torch::Tensor& inp, torch::Tensor& out);
void all_reduce_unreg(fptr_t _fa, torch::Tensor& inp, torch::Tensor& reg_buffer,
                      torch::Tensor& out);
void dispose(fptr_t _fa);
int64_t meta_size();
void register_buffer(fptr_t _fa, torch::Tensor& t,
                     const std::vector<std::string>& handles,
                     const std::vector<int64_t>& offsets);
std::tuple<std::vector<std::string>, std::vector<int64_t>> get_graph_buffer_ipc_meta(
    fptr_t _fa);
void register_graph_buffers(fptr_t _fa, const std::vector<std::string>& handles,
                            const std::vector<std::vector<int64_t>>& offsets);

@bnellnm
Copy link
Contributor Author

bnellnm commented May 31, 2024

BTW, what is the initial motivation of changing the binding system to TORCH_LIBRARY?

I am working on a custom backed for torch.compile that will be used to do optimizations on models, e.g. epilog fusion, quantization, etc. I was running into issues with the custom operators that turned out to be solvable by using TORCH_LIBRARY instead of PYBIND11_MODULE.

@bnellnm bnellnm changed the title Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops Jun 1, 2024
@bnellnm bnellnm marked this pull request as ready for review June 1, 2024 21:50
@bnellnm bnellnm marked this pull request as draft June 4, 2024 15:48
csrc/pybind.cpp Outdated Show resolved Hide resolved
@bnellnm bnellnm marked this pull request as ready for review June 4, 2024 19:30
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, Bill. Thanks for working on this.

@bnellnm
Copy link
Contributor Author

bnellnm commented Jun 4, 2024

cc @WoosukKwon , @youkaichao , @pcmoritz , @mgoin

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, my initial reuqest was to build python agnostic wheels, and this PR achieves the goal, so I give my approval. Please wait for the rest people to comment on the other pespectives.

@bnellnm
Copy link
Contributor Author

bnellnm commented Jun 5, 2024

LGTM, my initial reuqest was to build python agnostic wheels, and this PR achieves the goal, so I give my approval. Please wait for the rest people to comment on the other pespectives.

No problem. Thanks for the review!

csrc/cache.h Show resolved Hide resolved
csrc/cpu/pybind.cpp Outdated Show resolved Hide resolved
csrc/moe/moe_ops.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for doing this! Before merging, can you:

  • Rename pybind to something that makes it clearer that TORCH_LIBRARY is being used? E.g. torch_library_bindings
  • Investigate if there is an impact on the performance (e.g. by changing float to double and int to int64_t -- btw, why is this needed, do you have some references about this)

@bnellnm
Copy link
Contributor Author

bnellnm commented Jun 7, 2024

This looks great, thanks for doing this! Before merging, can you:

  • Rename pybind to something that makes it clearer that TORCH_LIBRARY is being used? E.g. torch_library_bindings
  • Investigate if there is an impact on the performance (e.g. by changing float to double and int to int64_t -- btw, why is this needed, do you have some references about this)

Pytorch only supports double and int64_t for scalars. The note is buried in this link: https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html. Also, python represents its float types as C/C++ double precision. See https://docs.python.org/3/library/stdtypes.html

I can double check performance but I imagine that pybind11 was already downcasting python double-floats to C floats.

csrc/cache.h Show resolved Hide resolved
@mgoin mgoin merged commit 5467ac3 into vllm-project:main Jun 9, 2024
103 checks passed
@mgoin mgoin deleted the torch-library branch June 9, 2024 20:23
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Jun 10, 2024
robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 11, 2024
tjohnson31415 added a commit to tjohnson31415/vllm that referenced this pull request Jun 11, 2024
* upstream/main: (126 commits)
  [Bugfix][Frontend] Cleanup "fix chat logprobs" (vllm-project#5026)
  [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (vllm-project#5312)
  [Misc] Various simplifications and typing fixes (vllm-project#5368)
  [ci] Fix Buildkite agent path (vllm-project#5392)
  [Doc] Add documentation for FP8 W8A8 (vllm-project#5388)
  Bump version to v0.5.0 (vllm-project#5384)
  [Docs] Alphabetically sort sponsors (vllm-project#5386)
  [Docs] Add Docs on Limitations of VLM Support (vllm-project#5383)
  [ci] Mount buildkite agent on Docker container to upload benchmark results (vllm-project#5330)
  [ci] Use small_cpu_queue for doc build (vllm-project#5331)
  [Bugfix] Fix LLaVA-NeXT (vllm-project#5380)
  [Feature][Frontend]:  Continued `stream_options` implementation also in CompletionRequest (vllm-project#5319)
  [Model] Initial support for LLaVA-NeXT (vllm-project#4199)
  [Misc] Improve error message when LoRA parsing fails (vllm-project#5194)
  [misc][typo] fix typo (vllm-project#5372)
  [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (vllm-project#5374)
  [Misc] Update to comply with the new `compressed-tensors` config (vllm-project#5350)
  [Bugfix] Fix KeyError: 1 When Using LoRA adapters (vllm-project#5164)
  [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (vllm-project#5047)
  [mis][ci/test] fix flaky test in test_sharded_state_loader.py (vllm-project#5361)
  ...
@changyuanzhangchina
Copy link

[rank0]: torch.ops._C.rms_norm(out, input, weight, epsilon)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 921, in getattr
[rank0]: raise AttributeError(
[rank0]: AttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'

when build the vllm from source, this error happens, how to fix it?

@bnellnm
Copy link
Contributor Author

bnellnm commented Jun 11, 2024

[rank0]: torch.ops._C.rms_norm(out, input, weight, epsilon) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 921, in getattr [rank0]: raise AttributeError( [rank0]: AttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'

when build the vllm from source, this error happens, how to fix it?

Hi @changyuanzhangchina , can you provide some more context? Is this from a clean build area? What were you running when you got this error?

@changyuanzhangchina
Copy link

[rank0]: torch.ops._C.rms_norm(out, input, weight, epsilon) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 921, in getattr [rank0]: raise AttributeError( [rank0]: AttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'
when build the vllm from source, this error happens, how to fix it?

Hi @changyuanzhangchina , can you provide some more context? Is this from a clean build area? What were you running when you got this error?

hi, we have found pip install -e . works well.

while python setup.py install (develop) can incur the error.

thanks.

@zhouyuan
Copy link
Contributor

Hi @bnellnm, we find there's a small performance regression(5%) with this commit on CPU backend:

  • on 5467ac3, the throughput performance is 502 tok/s
  • on 5d7e3d0, the throughput performance is 533 tok/s

The profile data didn't show big time increase in the those kernels so I think the overhead may due to the API change itself. Not sure if this also impact the CUDA backend performance, do you happen to have some ideas on this issue?

CC @bigPYJ1151 @WoosukKwon

thanks, -yuan

@tlrmchlsmth
Copy link
Collaborator

tlrmchlsmth commented Jun 15, 2024

@zhouyuan we can try to run this down.

Can you post repro instructions for your performance benchmark (scripts and hardware setup)?
Also if you could file an issue that would be super helpful.

@bnellnm
Copy link
Contributor Author

bnellnm commented Jun 15, 2024

Hi @bnellnm, we find there's a small performance regression(5%) with this commit on CPU backend:

  • on 5467ac3, the throughput performance is 502 tok/s
  • on 5d7e3d0, the throughput performance is 533 tok/s

The profile data didn't show big time increase in the those kernels so I think the overhead may due to the API change itself. Not sure if this also impact the CUDA backend performance, do you happen to have some ideas on this issue?

CC @bigPYJ1151 @WoosukKwon

thanks, -yuan

I didn't see any regressions in the CUDA based benchmarks. I don't know why this would only affect CPU performance.

@zhouyuan
Copy link
Contributor

zhouyuan commented Jun 15, 2024

@tlrmchlsmth @bnellnm thanks for the quick check! The throughput tests(llama2-7b) are running on a 56c SPR server(4th gen Xeon with AMX instructions, link) with benchmark_throughput.py + sharedgpt dataset. the latency test using benchmark_latency.py seems fine and no such regression found in my tests.

Indeed this is an odd issue as from my understanding the tensor does not get changed, only some constants type change(int -> int64_t). Will try to collect more profile data and update back.

thanks, -yuan

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024
@MahmoudAshraf97
Copy link
Contributor

Regarding the speed issue, I have a library that has a c++ extention that was linked to python using TORCH_LIBRARY because it was ported from torchaudio, in this commit I switched TORCH_LIBRARY to PYBIND11_MODULE and I gained around 40% speedup, although the code had no intensive math or operations, only tensor reading and accessing, so my guess is that the slowdown in vllm case might be related to how fast both TORCH_LIBRARY and PYBIND11_MODULE handle memory accessing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants