[Feature]: Small Model Large Latency Compared to SGLang and TensorRT-LLM #7339

CambioML · 2024-08-09T05:53:18Z

🚀 The feature, motivation and pitch

In this post, https://lmsys.org/blog/2024-07-25-sglang-llama3/, it looks like vllm is not efficient in small model size in both online and offline benchmark. What is the bottleneck for vllm for small model inference and whether this will be addressed to catch SGLang and TensorRT performance.

Alternatives

No response

Additional context

No response

mgoin · 2024-08-09T14:50:53Z

This is a duplicate issue and is currently being tracked/addressed by this Performance Roadmap meta issue #6801

robertgshaw2-neuralmagic · 2024-08-11T22:53:24Z

closing per michael's comment

CambioML added the feature request label Aug 9, 2024

mgoin added the duplicate This issue or pull request already exists label Aug 9, 2024

robertgshaw2-neuralmagic closed this as completed Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Small Model Large Latency Compared to SGLang and TensorRT-LLM #7339

[Feature]: Small Model Large Latency Compared to SGLang and TensorRT-LLM #7339

CambioML commented Aug 9, 2024

mgoin commented Aug 9, 2024

robertgshaw2-neuralmagic commented Aug 11, 2024

[Feature]: Small Model Large Latency Compared to SGLang and TensorRT-LLM #7339

[Feature]: Small Model Large Latency Compared to SGLang and TensorRT-LLM #7339

Comments

CambioML commented Aug 9, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

mgoin commented Aug 9, 2024

robertgshaw2-neuralmagic commented Aug 11, 2024