[RFC]: Single Program Multiple Data (SPMD) Worker Control Plane #6556

ruisearch42 · 2024-07-19T00:42:25Z

Motivation.

TL;DR: Introduce SPMD-style control plane to improve control plane architecture and optimize performance.

For distributed inference, vLLM currently leverages a “driver-worker”, along with other workers. As shown in the diagram below, this driver-worker is in the same process as the driver. It prepares the arguments, then broadcasts them to all other workers to execute the sharded model, leveraging NCCL as the control plane.

This architecture has a few drawbacks. First, the driver-worker needs to participate in the NCCL group and execute the model. Since NCCL broadcast is a synchronous operation, this creates interference with other driver functionality such as scheduling and affects performance.

Moreover, this architecture made it difficult to support speculative decoding. Specifically,

Speculative decoding framework may not run the draft model if Dynamic Speculative Decoding (DSD) or other policy is enabled. In this case, the decision of whether to run the draft model must be communicated to other ranks. So DSD cannot work with TP>1, unless there is additional communication (which incurs latency overhead).
Pipeline parallelism can be composed within the speculative decoding framework. However the speculative tokens must be sent to all workers, e.g. cross-node. If we have SPMD, then all PP ranks have access to the same information, and we don't need to do any communication on top of normal PP. This is important for latency.

Proposed Change.

We propose an architecture change to support SPMD-style control plane, as shown in the diagram below.

Specifically, we remove the argument preparation and model execution functionality from the driver, and make all workers SPMD-style: The LLMEngine/driver now passes the input to all the SPMD workers via a Ray DAG channel (shared memory), and each worker prepares arguments and executes its model shard. The results are passed back to the driver with Ray DAG channel as well.

Roadmap

SPMD functionality and optimizations:

SPMD functionality is implemented in [Core] Introduce SPMD worker execution using Ray accelerated DAG #6032
Delta input optimization (sending delta input, as opposed to full input) will be implemented and benchmarked
Serialization optimization (e.g., using a different serialization format like msgspec)

Features to build on top of SPMD:

Pipeline parallelism with Ray accelerated DAG
Speculative decoding

After comprehensive benchmarking and optimizations, SPMD will become the default and NCCL based control plane code path will be cleaned up.

Feedback Period.

No response

CC List.

@youkaichao @stephanie-wang @rkooo567 @cadedaniel

Any Other Things.

No response

cadedaniel · 2024-07-19T00:55:22Z

cc @LiuXiaoxuanPKU for DSD

njhill · 2024-07-31T23:40:29Z

@ruisearch42 sorry for the delay but I have a few questions. Coming from TGI which takes an SPMD approach, I actually saw the way that the driver worker participates in the collective communications an advantage in terms of reduced data movement.

Speculative decoding framework may not run the draft model if Dynamic Speculative Decoding (DSD) or other policy is enabled. In this case, the decision of whether to run the draft model must be communicated to other ranks. So DSD cannot work with TP>1, unless there is additional communication (which incurs latency overhead).

To clarify, we're talking specifically about TP>1 for the draft model here right? not for draft tp=1 and target tp>1?

Could you elaborate on how SPMD in particular solves this? If the decision is per top-level spec-decoding step, why can't that be included in the existing metadata that's broadcast from the driver?

Pipeline parallelism can be composed within the speculative decoding framework. However the speculative tokens must be sent to all workers, e.g. cross-node. If we have SPMD, then all PP ranks have access to the same information, and we don't need to do any communication on top of normal PP. This is important for latency.

Is the thinking that the layers of the draft model itself would also be distributed between nodes with PP? Could you explain why the PP ranks don't have access to the same information with the current implementation?

I am probably missing some obvious things here so apologies in advance!

cadedaniel · 2024-08-01T06:12:23Z

Speculative decoding framework may not run the draft model if Dynamic Speculative Decoding (DSD) or other policy is enabled. In this case, the decision of whether to run the draft model must be communicated to other ranks. So DSD cannot work with TP>1, unless there is additional communication (which incurs latency overhead).

To clarify, we're talking specifically about TP>1 for the draft model here right? not for draft tp=1 and target tp>1?

Could you elaborate on how SPMD in particular solves this? If the decision is per top-level spec-decoding step, why can't that be included in the existing metadata that's broadcast from the driver?

First, to lay out the problem clearly:

When draft_tp>1 and target_tp>1, the non-zero ranks do not know whether they should run the draft model for a given input. This is because there can be a policy to turn off speculation which only rank0 is privy to (currently either DSD or if all sequences are too long for the proposal model).
When draft_tp=1 and target_tp>1, the non-zero ranks do not know what the proposal tokens are, because only rank0 runs the draft model. These must be communicated in some way to the other ranks so that they may form a batch for scoring with the target model.

Succinctly, we need to (1) communicate the result of any dynamic speculation policy from rank0 to nonzero ranks, and (2) communicate the proposal tokens from rank0 to nonzero ranks.

To implement this, there are two different options I can think of:

Communicate same input to all workers, which use SPMD style where all workers have perfect information. There is no subsequent control flow communication required since dynamic speculation policies can be deterministic, and all ranks will have access to sampled tokens because the sampler does an allgather already to get logits.
Add explicit control flow communication after the dynamic speculation policy runs. Add another explicit control flow communication after the draft model runs to communicate the proposal tokens. All ranks have the dynamic speculation policy decision, and all ranks have the proposal tokens.

SPMD is better than the alternative for two reasons:

Latency: we only need to write input metadata once to shared memory, then all TP processes can read from it and execute their control flow without any further communication. Not only can you not do better than this in terms of low latency, it requires more complex software engineering to create low-latency communications on GPU.
Separation of concerns: Once the input metadata information is communicated, the worker logic can execute unobstructed without further control plane communication. This simplifies the software (don't have to deal with control-flow communication deadlocks!), and also allows composability of workers within workers for more advanced algorithms such as staged speculative decoding.

The downside of SPMD is we waste energy on the machine as we expend the flops for the draft model on each GPU instead of on only one GPU, which is an acceptable tradeoff given current requirements.

Pipeline parallelism can be composed within the speculative decoding framework. However the speculative tokens must be sent to all workers, e.g. cross-node. If we have SPMD, then all PP ranks have access to the same information, and we don't need to do any communication on top of normal PP. This is important for latency.

Is the thinking that the layers of the draft model itself would also be distributed between nodes with PP? Could you explain why the PP ranks don't have access to the same information with the current implementation?

I am probably missing some obvious things here so apologies in advance!

The primary concern here is a separation of concerns, so that the proposal method can use the deployment configuration that's best what the user wants. So if they are using PP for the target model and PP is also good for the draft model, they can be free to do so without refactoring the framework.

I am not sure if this will be a popular configuration or one we support, but one could imagine such a scenario when the user is trying to balance the speculative workload across PP ranks. We could communicate control flow communication for requirements (1) and (2) above in PP, or we could simply communicate the initial state to the world and have each rank perform work according to its identity.

njhill · 2024-08-02T05:54:21Z

Thanks @cadedaniel for taking the time to explain this in so much detail.

Still trying to wrap my head around it fully. I guess I still don't see why the driver can't participate, with a single metadata broadcast per top-level step (as is already happens for non spec-decode), and within each top-level step the actions can still be mirrored (and can include proposer + target, or staged chain, etc.) including in the driver.

cadedaniel · 2024-08-04T16:48:39Z

Still trying to wrap my head around it fully. I guess I still don't see why the driver can't participate, with a single metadata broadcast per top-level step (as is already happens for non spec-decode), and within each top-level step the actions can still be mirrored (and can include proposer + target, or staged chain, etc.) including in the driver.

Yeah, what you describe is exactly the goal. This is "spmd" since each rank runs the same program. The question is where that metadata broadcast happens -- for spec decode we need it to happen above the worker entrypoint so that workers may assume all peers have the same information.

rkooo567 · 2024-08-06T08:51:14Z

The first PR is out #7109

github-actions · 2024-11-06T01:58:51Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-12-06T02:07:26Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

ruisearch42 added the RFC label Jul 19, 2024

This was referenced Jul 19, 2024

[WIP] [Speculative Decoding] Support draft model on different tensor-parallel size than target model (Extended) #5856

Draft

[core] Sampling controller interface #6273

Open

simon-mo mentioned this issue Jul 25, 2024

[RFC]: Performance Roadmap #6801

Open

5 tasks

youkaichao mentioned this issue Aug 5, 2024

[Core] Concurrently Poll Ray Driver and Worker Results to Avoid Distributed Init Deadlock #7159

Closed

jon-chuang mentioned this issue Aug 5, 2024

[Bug]: Dead lock in distributed inference when ray worker raises an exception #3455

Closed

rkooo567 mentioned this issue Aug 6, 2024

[Core] Optimize SPMD architecture with delta + serialization optimization #7109

Merged

Yang-x-Zhao mentioned this issue Sep 29, 2024

vLLM's V1 Engine Architecture #8779

Open

1 task

github-actions bot added the stale label Nov 6, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2024

eric-haibin-lin mentioned this issue Dec 21, 2024

[RFC]: Fully SPMD Execution for Offline Inference #11400

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Single Program Multiple Data (SPMD) Worker Control Plane #6556

[RFC]: Single Program Multiple Data (SPMD) Worker Control Plane #6556

ruisearch42 commented Jul 19, 2024

cadedaniel commented Jul 19, 2024

njhill commented Jul 31, 2024

cadedaniel commented Aug 1, 2024

njhill commented Aug 2, 2024

cadedaniel commented Aug 4, 2024 •

edited by linear bot

Loading

rkooo567 commented Aug 6, 2024

github-actions bot commented Nov 6, 2024

github-actions bot commented Dec 6, 2024

[RFC]: Single Program Multiple Data (SPMD) Worker Control Plane #6556

[RFC]: Single Program Multiple Data (SPMD) Worker Control Plane #6556

Comments

ruisearch42 commented Jul 19, 2024

Motivation.

Proposed Change.

Roadmap

Feedback Period.

CC List.

Any Other Things.

cadedaniel commented Jul 19, 2024

njhill commented Jul 31, 2024

cadedaniel commented Aug 1, 2024

njhill commented Aug 2, 2024

cadedaniel commented Aug 4, 2024 • edited by linear bot Loading

rkooo567 commented Aug 6, 2024

github-actions bot commented Nov 6, 2024

github-actions bot commented Dec 6, 2024

cadedaniel commented Aug 4, 2024 •

edited by linear bot

Loading