[coll] Add nccl. #9726

trivialfis · 2023-10-26T09:37:04Z

Different from the existing one for allgather-v, this uses ring-based custom implementation instead of rotating broadcasts. We can use a similar path for bit-wise allreduce, I will leave that as a future to-do item.

trivialfis · 2023-10-26T10:43:32Z

cc @rongou .

rongou · 2023-10-26T17:36:59Z

src/collective/coll.cu

+  } << [&] { return nccl->Block(); };
+}
+
+[[nodiscard]] Result NCCLColl::AllgatherV(Comm const& comm, common::Span<std::int8_t const> data,


I think the original implementation is based on this: https://arxiv.org/abs/1812.05964. Have you looked at the performance implications?

Not yet, it's quite difficult to set up a benchmark for the existing one since it doesn't support thread-based multi-gpu and doesn't have a Python interface.

I have implemented both algorithms for both CPU and GPU. For GPU indeed the bcast is faster for small-size communication. For CPU the difference seems to be minimal, probably just due to my very primitive implementation. I think I will leave both implementations here for now and defer the formal benchmark.

Yeah for GPU the memcpy can be expensive.

Haven't looked through nsys, are you talking about memcpy inside ncclSend/Recv?

Different the from existing one for allgather-v, this uses ring-based custom implementation instead of rotating broadcast.

rongou · 2023-10-27T17:27:53Z

src/collective/coll.cu

+  } << [&] { return nccl->Block(); };
+}
+
+[[nodiscard]] Result NCCLColl::AllgatherV(Comm const& comm, common::Span<std::int8_t const> data,


Yeah for GPU the memcpy can be expensive.

rongou · 2023-10-27T17:28:12Z

src/collective/coll.cu

+  CHECK(nccl);
+  // get worker offset
+  detail::AllgatherVOffset(sizes, recv_segments);
+  // copy data


Is this still needed for the broadcast implementation?

Yes, we need to copy the data from send buffer to recv buffer

The original implementation doesn't have a copy, right? I think this is only needed for the ring implementation.

I can try to detect the pointer range and see whether send and recv overlap, then use inplace when possible. Is this what you mean?

In the broadcast implementation, there are two separate parameters data and recv, so data is broadcast into recv, no need to copy first; while in the ring implementation there is only recv. I think all you need to do here is to move the memcpy under the kRing case.

Ah, got it, thank you for the explanation!

trivialfis changed the title ~~[coll Add nccl.~~ [coll] Add nccl. Oct 26, 2023

rongou reviewed Oct 26, 2023

View reviewed changes

trivialfis added 5 commits October 27, 2023 13:35

[coll] Add nccl.

29d362f

Different the from existing one for allgather-v, this uses ring-based custom implementation instead of rotating broadcast.

cleanups.

8a3c654

Single GPU tests.

628cb9a

broadcast.

84ca247

cpu impl as well.

f08a958

trivialfis force-pushed the rabit-nccl branch from b09f516 to f08a958 Compare October 27, 2023 07:04

trivialfis requested a review from rongou October 27, 2023 07:29

rongou reviewed Oct 27, 2023

View reviewed changes

trivialfis added 2 commits October 28, 2023 05:35

remove copy for GPU.

ca2159f

Merge branch 'master' into rabit-nccl

057d4d7

rongou approved these changes Oct 27, 2023

View reviewed changes

Test inplace.

c4a3894

trivialfis merged commit 6755179 into dmlc:master Oct 28, 2023
21 checks passed

trivialfis deleted the rabit-nccl branch October 28, 2023 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[coll] Add nccl. #9726

[coll] Add nccl. #9726

trivialfis commented Oct 26, 2023 •

edited

Loading

trivialfis commented Oct 26, 2023

rongou Oct 26, 2023

trivialfis Oct 27, 2023

trivialfis Oct 27, 2023 •

edited

Loading

rongou Oct 27, 2023

trivialfis Oct 27, 2023 •

edited

Loading

rongou Oct 27, 2023

rongou Oct 27, 2023

trivialfis Oct 27, 2023

rongou Oct 27, 2023

trivialfis Oct 27, 2023

rongou Oct 27, 2023

trivialfis Oct 27, 2023

[coll] Add nccl. #9726

[coll] Add nccl. #9726

Conversation

trivialfis commented Oct 26, 2023 • edited Loading

trivialfis commented Oct 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Oct 26, 2023 •

edited

Loading

trivialfis Oct 27, 2023 •

edited

Loading

trivialfis Oct 27, 2023 •

edited

Loading