Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] Use torch.compile for basic custom ops #7110

Closed
wants to merge 16 commits into from

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Aug 3, 2024

This PR introduces torch.compile for the following basic custom ops: activations and RMSNorm.

The main goals are:

  1. Reduce the number of custom kernels maintained by vLLM. (I intentionally kept the kernel for now, but will delete the unused ones once people agree with the direction of this PR) Previous attempt: [Hardware][Intel] Generate custom activation ops using torch.compile for CPU backend. #5446
  2. Optimize GemmaRMSNorm. This leads to 20% throughput improvement for Gemma2-27B on 1xH100.

Copy link

github-actions bot commented Aug 3, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@WoosukKwon
Copy link
Collaborator Author

cc @youkaichao @bigPYJ1151 @bnellnm

Copy link
Contributor

@bigPYJ1151 bigPYJ1151 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @WoosukKwon this PR looks good! I have verified it on the CPU backend and it worked well w/wo the multiprocessing.

To make this PR work on the CPU backend, we should add triton >= 3.0.0 in requirements-cpu.txt to avoid an import error. It looks like a bug of torch 2.4

"""
if not self._is_compiled and not envs.VLLM_TEST_DYNAMO_GRAPH_CAPTURE:
self.forward_static = torch.compile( # type: ignore
self.forward_static,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting dynamic=True explicitly can reduce recompilations, because of the dynamic batchsize. Maybe the cuda is similar.

self.forward_static = torch.compile( # type: ignore
self.forward_static,
options={
"fx_graph_cache": True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"fx_graph_cache": True,

This option causes lock contention when using multiprocessing.

Copy link
Contributor

@jon-chuang jon-chuang Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can set a per-process fx_graph_cache

You can set the env var TORCHINDUCTOR_CACHE_DIR

See: https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. I think we can explore caching in a future PR.

Comment on lines 26 to 27
@staticmethod
def forward_static(x: torch.Tensor) -> torch.Tensor:
Copy link
Contributor

@bnellnm bnellnm Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this completely eliminate the custom silu_and_mul kernel? If so, should it be removed from csrc?

Ditto for the rest of the custom activation ops.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I think we can delete most of them, while leaving some (e.g., in csrc/legacy) for potential future use?

@hmellor hmellor mentioned this pull request Aug 7, 2024
4 tasks
) -> Tuple[torch.Tensor, torch.Tensor]:
# forward_native() is too complex to be optimized by torch.compile.
# Fall back to the custom C++ kernel.
return self.forward_cuda(positions, query, key, offsets)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it weird that forward_cpu calls into forward_cuda? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename that if it dispatches to C++ CPU kernel as well?

@hmellor
Copy link
Collaborator

hmellor commented Sep 12, 2024

@WoosukKwon can we expect this PR to merge soon? Is there anything we can do to help it get merged?

@hmellor
Copy link
Collaborator

hmellor commented Nov 6, 2024

@youkaichao has this been superseded by your torch.compile work?

@WoosukKwon WoosukKwon closed this Nov 7, 2024
@WoosukKwon WoosukKwon deleted the torch-compile-layers branch November 7, 2024 19:07
@WoosukKwon
Copy link
Collaborator Author

@hmellor Thanks for bringing this up again. Yeah I think we don't need this PR anymore since our torch.compile integration is more mature now. I think we could turn it on by default after few weeks? cc @youkaichao

@hmellor
Copy link
Collaborator

hmellor commented Nov 7, 2024

Thanks for the update!

When torch compile gets enabled by default, could we remove custom CPU ops that this PR intended to replace with torch compile equivalents?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants