Optimize the KV transfer pipe implementation #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Main changes
cudaMemcpyH2D
by removingto(self.device)
Performance benchmark
The numbers are tested between 2 A40s WITHOUT NVLINK
to(self.device)
Notes
We should still avoid sending too many tensors. The current implementation needs a single H2D/D2H memcpy when dealing with metadata, where Each H2D/D2H call will incur a fixed 100~200us overhead. (The overhead mainly comes from synchronization between nccl stream and default stream).
It takes 400~500us for NCCL to send/recv a tensor of ~4Mbytes (2000 tokens * 1024 hidden dimensions * 1 layer * BF16), which means the H2D/D2H memcpy overhead can be as large as 20~30%.
This overhead will be decreased when the size of the tensor grows larger. When sending 80 layers together, the estimated overhead is only 2~3%.
@KuntaiDu Please also test it under NVLink environments and let me know if there is any unexpected problems.