-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Single Program Multiple Data (SPMD) Worker Control Plane #6556
Comments
cc @LiuXiaoxuanPKU for DSD |
@ruisearch42 sorry for the delay but I have a few questions. Coming from TGI which takes an SPMD approach, I actually saw the way that the driver worker participates in the collective communications an advantage in terms of reduced data movement.
To clarify, we're talking specifically about TP>1 for the draft model here right? not for draft tp=1 and target tp>1? Could you elaborate on how SPMD in particular solves this? If the decision is per top-level spec-decoding step, why can't that be included in the existing metadata that's broadcast from the driver?
Is the thinking that the layers of the draft model itself would also be distributed between nodes with PP? Could you explain why the PP ranks don't have access to the same information with the current implementation? I am probably missing some obvious things here so apologies in advance! |
First, to lay out the problem clearly:
Succinctly, we need to (1) communicate the result of any dynamic speculation policy from rank0 to nonzero ranks, and (2) communicate the proposal tokens from rank0 to nonzero ranks. To implement this, there are two different options I can think of:
SPMD is better than the alternative for two reasons:
The downside of SPMD is we waste energy on the machine as we expend the flops for the draft model on each GPU instead of on only one GPU, which is an acceptable tradeoff given current requirements.
The primary concern here is a separation of concerns, so that the proposal method can use the deployment configuration that's best what the user wants. So if they are using PP for the target model and PP is also good for the draft model, they can be free to do so without refactoring the framework. I am not sure if this will be a popular configuration or one we support, but one could imagine such a scenario when the user is trying to balance the speculative workload across PP ranks. We could communicate control flow communication for requirements (1) and (2) above in PP, or we could simply communicate the initial state to the world and have each rank perform work according to its identity. |
Thanks @cadedaniel for taking the time to explain this in so much detail. Still trying to wrap my head around it fully. I guess I still don't see why the driver can't participate, with a single metadata broadcast per top-level step (as is already happens for non spec-decode), and within each top-level step the actions can still be mirrored (and can include proposer + target, or staged chain, etc.) including in the driver. |
Yeah, what you describe is exactly the goal. This is "spmd" since each rank runs the same program. The question is where that metadata broadcast happens -- for spec decode we need it to happen above the worker entrypoint so that workers may assume all peers have the same information. |
The first PR is out #7109 |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Motivation.
TL;DR: Introduce SPMD-style control plane to improve control plane architecture and optimize performance.
For distributed inference, vLLM currently leverages a “driver-worker”, along with other workers. As shown in the diagram below, this driver-worker is in the same process as the driver. It prepares the arguments, then broadcasts them to all other workers to execute the sharded model, leveraging NCCL as the control plane.
This architecture has a few drawbacks. First, the driver-worker needs to participate in the NCCL group and execute the model. Since NCCL broadcast is a synchronous operation, this creates interference with other driver functionality such as scheduling and affects performance.
Moreover, this architecture made it difficult to support speculative decoding. Specifically,
Proposed Change.
We propose an architecture change to support SPMD-style control plane, as shown in the diagram below.
Specifically, we remove the argument preparation and model execution functionality from the driver, and make all workers SPMD-style: The LLMEngine/driver now passes the input to all the SPMD workers via a Ray DAG channel (shared memory), and each worker prepares arguments and executes its model shard. The results are passed back to the driver with Ray DAG channel as well.
Roadmap
SPMD functionality and optimizations:
Features to build on top of SPMD:
After comprehensive benchmarking and optimizations, SPMD will become the default and NCCL based control plane code path will be cleaned up.
Feedback Period.
No response
CC List.
@youkaichao @stephanie-wang @rkooo567 @cadedaniel
Any Other Things.
No response
The text was updated successfully, but these errors were encountered: