Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Small Model Large Latency Compared to SGLang and TensorRT-LLM #7339

Closed
CambioML opened this issue Aug 9, 2024 · 2 comments
Closed
Labels
duplicate This issue or pull request already exists feature request

Comments

@CambioML
Copy link

CambioML commented Aug 9, 2024

🚀 The feature, motivation and pitch

In this post, https://lmsys.org/blog/2024-07-25-sglang-llama3/, it looks like vllm is not efficient in small model size in both online and offline benchmark. What is the bottleneck for vllm for small model inference and whether this will be addressed to catch SGLang and TensorRT performance.

Alternatives

No response

Additional context

No response

@mgoin mgoin added the duplicate This issue or pull request already exists label Aug 9, 2024
@mgoin
Copy link
Member

mgoin commented Aug 9, 2024

This is a duplicate issue and is currently being tracked/addressed by this Performance Roadmap meta issue #6801

@robertgshaw2-neuralmagic
Copy link
Collaborator

closing per michael's comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists feature request
Projects
None yet
Development

No branches or pull requests

2 participants