Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support remote inference on Triton + TensorRT or vLLM or TGI #1997

Closed
percyliang opened this issue Nov 10, 2023 · 2 comments · Fixed by #2402
Closed

Support remote inference on Triton + TensorRT or vLLM or TGI #1997

percyliang opened this issue Nov 10, 2023 · 2 comments · Fixed by #2402
Labels
enhancement New feature or request models p2 Priority 2 (Good to have for release)

Comments

@percyliang
Copy link
Contributor

The preferred way to run models is to stand up an inference server (e.g., Triton + TensorRT or vLLM or TGI) locally and then hit it from HELM as an API. This way, HELM can benefit from all the crazy inference optimizations that are done. We need to demonstrate a proof of concept and write docs for this.

@yifanmai
Copy link
Collaborator

yifanmai commented Nov 10, 2023

Opened draft PR #1975 for vLLM.

@yifanmai yifanmai changed the title run fast inference on custom models Support remote inference on Triton + TensorRT or vLLM or TGI Jan 9, 2024
@yifanmai yifanmai added enhancement New feature or request p2 Priority 2 (Good to have for release) models labels Jan 9, 2024
@yifanmai
Copy link
Collaborator

The TGI part is duplicated by #1866.
I don't know of any users asking for Triton currently, so I will deprioritize that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request models p2 Priority 2 (Good to have for release)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants