Optimize the KV transfer pipe implementation #3

ApostaC · 2024-09-10T23:50:38Z

Main changes

Removed useless dependencies and unused variables
Avoid cudaMemcpyH2D by removing to(self.device)
Change the way of sending metadata. Now all the metadata will be sent/recved in a single tensor with a single H2D/D2H copy
Change what is being done in the sender's thread to avoid underutilization caused by GIL

Performance benchmark

The numbers are tested between 2 A40s WITHOUT NVLINK

	Average latency (ms)	10% latency (ms)	Median latency (ms)	90% latency (ms)	98% latency (ms)
Original version	193.1	150.7	198.0	211.8	368.5
Fix `to(self.device)`	64.33	56.38	56.98	70.42	139.16
Final version	52.52	45.84	46.72	52.72	64.59

Notes

We should still avoid sending too many tensors. The current implementation needs a single H2D/D2H memcpy when dealing with metadata, where Each H2D/D2H call will incur a fixed 100~200us overhead. (The overhead mainly comes from synchronization between nccl stream and default stream).

It takes 400~500us for NCCL to send/recv a tensor of ~4Mbytes (2000 tokens * 1024 hidden dimensions * 1 layer * BF16), which means the H2D/D2H memcpy overhead can be as large as 20~30%.

This overhead will be decreased when the size of the tensor grows larger. When sending 80 layers together, the estimated overhead is only 2~3%.

@KuntaiDu Please also test it under NVLink environments and let me know if there is any unexpected problems.

github-actions · 2024-09-10T23:50:51Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

KuntaiDu

LGTM

Optimize the KV transfer pipe implementation

[Add] optimized implementation for KV transfer pipe

bb86588

KuntaiDu approved these changes Sep 11, 2024

View reviewed changes

KuntaiDu merged commit 1377912 into kuntai-disagg-refactor Sep 11, 2024
1 check passed

KuntaiDu added a commit that referenced this pull request Nov 20, 2024

Merge pull request #3 from KuntaiDu/yihua-kv-pipe

8e2d568

Optimize the KV transfer pipe implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the KV transfer pipe implementation #3

Optimize the KV transfer pipe implementation #3

ApostaC commented Sep 10, 2024

github-actions bot commented Sep 10, 2024

KuntaiDu left a comment

Optimize the KV transfer pipe implementation #3

Optimize the KV transfer pipe implementation #3

Conversation

ApostaC commented Sep 10, 2024

Main changes

Performance benchmark

Notes

github-actions bot commented Sep 10, 2024

KuntaiDu left a comment

Choose a reason for hiding this comment