Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aDAG] Allow custom NCCL group for aDAG #47141

Merged
merged 8 commits into from
Sep 6, 2024

Conversation

ruisearch42
Copy link
Contributor

@ruisearch42 ruisearch42 commented Aug 14, 2024

Why are these changes needed?

Allow custom NCCL group for aDAG so that we can reuse what the user already created.

Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability.

vLLM prototype: vllm-project/vllm#7568

Related issue number

Closes #45593

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/experimental/channel/nccl_group_interface.py Outdated Show resolved Hide resolved
python/ray/dag/dag_node.py Outdated Show resolved Hide resolved
"""

@abstractmethod
def get_rank(self, actor: ray.actor.ActorHandle) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it find the next rank correctly? (is there any assumption in the rank order that could be different from communicator that's passed?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to get the next rank? As long as there is a map from Actor to rank it will be sufficient?

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Some last comments

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved
@@ -776,7 +783,16 @@ def _preprocess(self) -> None:
if None in nccl_actors:
raise ValueError("Driver cannot participate in the NCCL group.")
if nccl_actors and self._nccl_group_id is None:
self._nccl_group_id = _init_nccl_group(nccl_actors)
if self._custom_nccl_group:
self._nccl_group_id = _set_nccl_group(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of having if/else here, is there any way to make the default cupy nccl group to follow the same interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChannelContext.nccl_groups has the type Dict[str, "ActorGroup"], that is sufficient?

python/ray/experimental/channel/actor_group.py Outdated Show resolved Hide resolved
python/ray/experimental/channel/actor_group.py Outdated Show resolved Hide resolved
@ruisearch42 ruisearch42 force-pushed the custom_nccl branch 4 times, most recently from ef33b3e to bcaae80 Compare August 29, 2024 23:51
@ruisearch42 ruisearch42 marked this pull request as ready for review August 30, 2024 04:20
@ruisearch42 ruisearch42 added the go add ONLY when ready to merge, run all tests label Aug 30, 2024
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 3 questions; Let's merge it after addressing these.

  1. Is "custom_nccl_channel" necessary? I wonder if we can just make it overwrite default nccl channel (so remove the term custom nccl channel, and we just allow to overwrite nccl channel using a handle)?
  2. Can you add tests with torch distributed (similar to vllm)?
  3. I wonder how the API will be changed with overlap compute/comm.

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved
python/ray/experimental/channel/conftest.py Show resolved Hide resolved
python/ray/experimental/channel/gpu_communicator.py Outdated Show resolved Hide resolved
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
wip
Signed-off-by: Ubuntu <ubuntu@ip-172-31-15-128.us-west-2.compute.internal>
@ruisearch42 ruisearch42 marked this pull request as draft September 5, 2024 06:11
Ubuntu added 4 commits September 5, 2024 21:53
up
Signed-off-by: Ubuntu <ubuntu@ip-172-31-15-128.us-west-2.compute.internal>
up
Signed-off-by: Ubuntu <ubuntu@ip-172-31-15-128.us-west-2.compute.internal>
up
Signed-off-by: Ubuntu <ubuntu@ip-172-31-15-128.us-west-2.compute.internal>
up
Signed-off-by: Ubuntu <ubuntu@ip-172-31-15-128.us-west-2.compute.internal>
@ruisearch42 ruisearch42 marked this pull request as ready for review September 5, 2024 23:44
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comments. Let's merge it after addressing it

up
Signed-off-by: Ubuntu <ubuntu@ip-172-31-15-128.us-west-2.compute.internal>
@rkooo567 rkooo567 enabled auto-merge (squash) September 6, 2024 19:09
@rkooo567 rkooo567 merged commit bbeee55 into ray-project:master Sep 6, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core][experimental] Allow custom NCCL group declaration in accelerated DAG
2 participants