-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add one example to run batch inference distributed on Ray #2696
Conversation
Co-authored-by: Zhe Zhang <zhz@anyscale.com>
This is a small and clean example. lgtm cc @simon-mo |
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's verify this works with tensor parallelism.
@Yard1 - confirmed tensor parallelism works when testing |
This PR is to add an example to run vLLM batch inference in a multi-nodes environment.
Ray Data is used here to orchestrate the workflow:
Tested w/ 58k number of prompts and
meta-llama/Llama-2-7b-chat-hf
. Whole job took 5 minutes on 10 L4 GPU nodes.