Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[coll] Add nccl. #9726

Merged
merged 8 commits into from
Oct 28, 2023
Merged

[coll] Add nccl. #9726

merged 8 commits into from
Oct 28, 2023

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Oct 26, 2023

Different from the existing one for allgather-v, this uses ring-based custom implementation instead of rotating broadcasts. We can use a similar path for bit-wise allreduce, I will leave that as a future to-do item.

@trivialfis trivialfis changed the title [coll Add nccl. [coll] Add nccl. Oct 26, 2023
@trivialfis
Copy link
Member Author

cc @rongou .

} << [&] { return nccl->Block(); };
}

[[nodiscard]] Result NCCLColl::AllgatherV(Comm const& comm, common::Span<std::int8_t const> data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original implementation is based on this: https://arxiv.org/abs/1812.05964. Have you looked at the performance implications?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, it's quite difficult to set up a benchmark for the existing one since it doesn't support thread-based multi-gpu and doesn't have a Python interface.

Copy link
Member Author

@trivialfis trivialfis Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented both algorithms for both CPU and GPU. For GPU indeed the bcast is faster for small-size communication. For CPU the difference seems to be minimal, probably just due to my very primitive implementation. I think I will leave both implementations here for now and defer the formal benchmark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for GPU the memcpy can be expensive.

Copy link
Member Author

@trivialfis trivialfis Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked through nsys, are you talking about memcpy inside ncclSend/Recv?

Different the from existing one for allgather-v, this uses ring-based custom
implementation instead of rotating broadcast.
} << [&] { return nccl->Block(); };
}

[[nodiscard]] Result NCCLColl::AllgatherV(Comm const& comm, common::Span<std::int8_t const> data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for GPU the memcpy can be expensive.

CHECK(nccl);
// get worker offset
detail::AllgatherVOffset(sizes, recv_segments);
// copy data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed for the broadcast implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to copy the data from send buffer to recv buffer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation doesn't have a copy, right? I think this is only needed for the ring implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to detect the pointer range and see whether send and recv overlap, then use inplace when possible. Is this what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the broadcast implementation, there are two separate parameters data and recv, so data is broadcast into recv, no need to copy first; while in the ring implementation there is only recv. I think all you need to do here is to move the memcpy under the kRing case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it, thank you for the explanation!

@trivialfis trivialfis merged commit 6755179 into dmlc:master Oct 28, 2023
21 checks passed
@trivialfis trivialfis deleted the rabit-nccl branch October 28, 2023 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants