Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Integration] Integrate vllm with experimental accelerated DAG API #2201

Closed
wants to merge 17 commits into from

Conversation

rkooo567
Copy link
Collaborator

@rkooo567 rkooo567 commented Dec 19, 2023

Hi, we have an experimental accelerated DAG developed based on rep ray-project/enhancements#48.

TL;DR is that we are implementing a compile DAG api that can reduce the control plane overhead from ray (See _compiled_dag_init_dag for details). Our microbenchmark shows 35x control plane overhead reduction in scatter gather type of workload (which is equivalent to how vllm implements tensor parallel. I.e., send a single input to N actors and get the result from all actors);

PR summary

  • Support a path to run compiled DAG.
  • Use pickle for serialization instead of Ray's default serializer (cloudpickle). pickle is much cheaper to serialize the input required by vllm. Cloudpickle is useful when you have large data that can zero-copy.

Benchmark

The experimental feature can be used with the nightly Ray for the evaluation. I ran the llama 7B benchmark on g5.12xlarge (4 A10 GPUs), TP=4, 10 iteration.

python benchmark_latency.py --use-ray-compiled-dag --tensor-parallel-size 4 --num-iters 10 --model "meta-llama/Llama-2-7b-hf"

And got the result

Tp4, pathway: 1.7389498465017823
Tp4, default: 2.0027053648009314

which is about 13% improvement.

Limitation

Note that the current nightly implementation has a several limitation. We are working on follow-ups, but the feature itself could be enabled by default after these features are implemented.

Followup

I am planning to make a follow up PR for these 2 features after merging https://github.com/ray-project/ray/pull/41943/files. The ETA is 12/26 (when I come back from OOO). Please let me know if async engine is a high priority item for evaluation.

  • Better error handling
  • Support async engine

@rkooo567 rkooo567 closed this Dec 19, 2023
ip

Added DeciLM-7b and DeciLM-7b-instruct (vllm-project#2062)

.
@rkooo567 rkooo567 reopened this Dec 19, 2023
@rkooo567 rkooo567 force-pushed the pathway-integration branch from 560c80f to d0721ac Compare December 19, 2023 12:31
@rkooo567 rkooo567 closed this Dec 19, 2023
@rkooo567 rkooo567 reopened this Dec 19, 2023
@rkooo567 rkooo567 changed the title todo [Ray Integration] Integrate vllm with experimental accelerated DAG Dec 19, 2023
@rkooo567 rkooo567 changed the title [Ray Integration] Integrate vllm with experimental accelerated DAG [Ray Integration] Integrate vllm with experimental accelerated DAG API Dec 19, 2023
@rkooo567
Copy link
Collaborator Author

rkooo567 commented Dec 19, 2023

Q: Please let me know Where I should add tests. I was thinking to add a flag to test_models.py, but it seems like it doesn't really test ray config (where tp > 1).

@njhill
Copy link
Member

njhill commented Jan 13, 2024

Presumably this is obsolete now that #2221 is merged?

@rkooo567
Copy link
Collaborator Author

@njhill we decided to contribute the feature off by default. We will internally productionize it within anyscale and consider to reenable in the future.

@rkooo567
Copy link
Collaborator Author

Decided to close over #2471

@rkooo567 rkooo567 closed this Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants