Support remote inference on Triton + TensorRT or vLLM or TGI #1997

percyliang · 2023-11-10T17:30:34Z

The preferred way to run models is to stand up an inference server (e.g., Triton + TensorRT or vLLM or TGI) locally and then hit it from HELM as an API. This way, HELM can benefit from all the crazy inference optimizations that are done. We need to demonstrate a proof of concept and write docs for this.

yifanmai · 2023-11-10T18:20:04Z

Opened draft PR #1975 for vLLM.

yifanmai · 2024-02-26T22:38:19Z

The TGI part is duplicated by #1866.
I don't know of any users asking for Triton currently, so I will deprioritize that.

yifanmai changed the title ~~run fast inference on custom models~~ Support remote inference on Triton + TensorRT or vLLM or TGI Jan 9, 2024

yifanmai added enhancement New feature or request p2 Priority 2 (Good to have for release) models labels Jan 9, 2024

This was referenced Feb 16, 2024

Delete GooseAIClient and temporarily remove GooseAI models #2378

Merged

Remove Microsoft's Megatron-Turing NLG models #2377

Merged

Upgrade OpenAI SDK #2384

Merged

yifanmai mentioned this issue Feb 26, 2024

Add vLLM client #2402

Merged

yifanmai closed this as completed in #2402 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support remote inference on Triton + TensorRT or vLLM or TGI #1997

Support remote inference on Triton + TensorRT or vLLM or TGI #1997

percyliang commented Nov 10, 2023

yifanmai commented Nov 10, 2023 •

edited

Loading

yifanmai commented Feb 26, 2024

Support remote inference on Triton + TensorRT or vLLM or TGI #1997

Support remote inference on Triton + TensorRT or vLLM or TGI #1997

Comments

percyliang commented Nov 10, 2023

yifanmai commented Nov 10, 2023 • edited Loading

yifanmai commented Feb 26, 2024

yifanmai commented Nov 10, 2023 •

edited

Loading