-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[coll] Add nccl. #9726
[coll] Add nccl. #9726
Conversation
cc @rongou . |
} << [&] { return nccl->Block(); }; | ||
} | ||
|
||
[[nodiscard]] Result NCCLColl::AllgatherV(Comm const& comm, common::Span<std::int8_t const> data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original implementation is based on this: https://arxiv.org/abs/1812.05964. Have you looked at the performance implications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, it's quite difficult to set up a benchmark for the existing one since it doesn't support thread-based multi-gpu and doesn't have a Python interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented both algorithms for both CPU and GPU. For GPU indeed the bcast is faster for small-size communication. For CPU the difference seems to be minimal, probably just due to my very primitive implementation. I think I will leave both implementations here for now and defer the formal benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for GPU the memcpy can be expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't looked through nsys, are you talking about memcpy inside ncclSend/Recv?
Different the from existing one for allgather-v, this uses ring-based custom implementation instead of rotating broadcast.
b09f516
to
f08a958
Compare
} << [&] { return nccl->Block(); }; | ||
} | ||
|
||
[[nodiscard]] Result NCCLColl::AllgatherV(Comm const& comm, common::Span<std::int8_t const> data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for GPU the memcpy can be expensive.
src/collective/coll.cu
Outdated
CHECK(nccl); | ||
// get worker offset | ||
detail::AllgatherVOffset(sizes, recv_segments); | ||
// copy data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed for the broadcast implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to copy the data from send buffer to recv buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation doesn't have a copy, right? I think this is only needed for the ring implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to detect the pointer range and see whether send and recv overlap, then use inplace when possible. Is this what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the broadcast implementation, there are two separate parameters data
and recv
, so data is broadcast into recv, no need to copy first; while in the ring implementation there is only recv
. I think all you need to do here is to move the memcpy under the kRing
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it, thank you for the explanation!
Different from the existing one for allgather-v, this uses ring-based custom implementation instead of rotating broadcasts. We can use a similar path for bit-wise allreduce, I will leave that as a future to-do item.