Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aDAG] Raise proper error message for nccl within the same actor #47250

Merged
merged 8 commits into from
Aug 22, 2024

Conversation

ruisearch42
Copy link
Contributor

@ruisearch42 ruisearch42 commented Aug 21, 2024

Why are these changes needed?

When NCCL type hint is used between methods from the same actor, misleading error message is raised:

(MyActor pid=95377) AssertionError: Channel 70fb3f47-dfcc-47f8-81b7-53a5fcf56dce does not exist in the buffer.

We should raise error message with proper information and is actionable.

Note that when user specifies a NCCL type hint between methods of the same actor, we don't want to implicitly change to use IntroProcessChannel underneath, which would make the behavior of aDAG ambiguous. Instead, we should raise a clear error and the user would be able to easily fix their aDAG program by removing the type hint.

Related issue number

Closes #47235

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be duplicate with:

logger.error(
"Detected a deadlock caused by using NCCL channels to "
f"transfer data between the task `{method}` and "
f"its downstream method `{downstream_method}` on the same "
f"actor {actor_handle}. Please remove "
'`TorchTensorType(transport="nccl")` between '
"DAG nodes on the same actor."
)

@ruisearch42
Copy link
Contributor Author

This seems to be duplicate with:

logger.error(
"Detected a deadlock caused by using NCCL channels to "
f"transfer data between the task `{method}` and "
f"its downstream method `{downstream_method}` on the same "
f"actor {actor_handle}. Please remove "
'`TorchTensorType(transport="nccl")` between '
"DAG nodes on the same actor."
)

Good call @kevin85421
I think this is by default enabled, right?

@woshiyyya just to confirm, you got the misleading error message when RAY_ADAG_ENABLE_DETECT_DEADLOCK is manually turned off?

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
@woshiyyya
Copy link
Member

woshiyyya commented Aug 21, 2024

@ruisearch42 yes I manually set RAY_ADAG_ENABLE_DETECT_DEADLOCK=0.

@rkooo567
Copy link
Contributor

@woshiyyya why do you need to manually set this now? do we understand the problem?

@woshiyyya
Copy link
Member

woshiyyya commented Aug 21, 2024

@rkooo567 yeah I think we still need to disable it.

Screenshot 2024-08-21 at 1 39 07 PM

In the distMM DAG, we still have a "crossing" : text1.agg_act -> vision1.bwd and vision1.agg_act -> text1.bwd here. If we don't disable it, the deadlock detection algorithm will raise an error, because it is designed for the ADAG before this PR: #46911.

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
@kevin85421
Copy link
Member

kevin85421 commented Aug 21, 2024

I think we just need to update the deadlock detection, and then we can consider using IntraProcessChannel directly when we find two DAG nodes are on the same actor.

@ruisearch42
Copy link
Contributor Author

I think we just need to update the deadlock detection.

I think we will need a new algorithm? How long would that take? I feel that's not trivial and may take some time, right?

@kevin85421
Copy link
Member

How about we make the change as lightweight as possible? For example, we can add an assert in:

for tensor in tensors:
# TODO: If there are multiple readers, can replace with a
# broadcast.
for rank in self._reader_ranks:
self._nccl_group.send(tensor, rank)

instead to make sure the sender / receiver are not the same rank.

@kevin85421
Copy link
Member

In addition, it is better not to update preprocess in my opinion. preprocess is a recursive function. Updating it may introduce more complexity than our imagination.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: separate the check from _detect_deadlock, and then checked before _detect_deadlock.

from ray.dag.constants import RAY_ADAG_ENABLE_DETECT_DEADLOCK
if RAY_ADAG_ENABLE_DETECT_DEADLOCK and self._detect_deadlock():
raise ValueError(
"This DAG cannot be compiled because it will deadlock on NCCL "
"calls. If you believe this is a false positive, please disable "
"the graph verification by setting the environment variable "
"RAY_ADAG_ENABLE_DETECT_DEADLOCK to 0 and file an issue at "
"https://github.com/ray-project/ray/issues/new/."
)

That is,

        from ray.dag.constants import RAY_ADAG_ENABLE_DETECT_DEADLOCK

        # detect whether using NCCL to pass tensors between DAG nodes on the same actor.

        if RAY_ADAG_ENABLE_DETECT_DEADLOCK and self._detect_deadlock():
            raise ValueError(
                "This DAG cannot be compiled because it will deadlock on NCCL "
                "calls. If you believe this is a false positive, please disable "
                "the graph verification by setting the environment variable "
                "RAY_ADAG_ENABLE_DETECT_DEADLOCK to 0 and file an issue at "
                "https://github.com/ray-project/ray/issues/new/."
            )

@ruisearch42
Copy link
Contributor Author

How about we make the change as lightweight as possible? For example, we can add an assert in:

for tensor in tensors:
# TODO: If there are multiple readers, can replace with a
# broadcast.
for rank in self._reader_ranks:
self._nccl_group.send(tensor, rank)

instead to make sure the sender / receiver are not the same rank.

Interesting thought. We want to have compile time checks rather than runtime checks though.

@ruisearch42
Copy link
Contributor Author

In addition, it is better not to update preprocess in my opinion. preprocess is a recursive function. Updating it may introduce more complexity than our imagination.

That's the standard place where we do input validations so I wouldn't worry about it.

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
@kevin85421
Copy link
Member

Interesting thought. We want to have compile time checks rather than runtime checks though.

Yep, that's why I prefer to detect in deadlock detection. I just thought as a workaround. I prefer to make it as light-weight as possible.

That's the standard place where we do input validations so I wouldn't worry about it.

It's ok for me. It's just my personal preference to unify the validation logic so that we can easily manage it.

I will start reviewing another part. Would you mind opening an issue to track the progress of the follow-up?

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have special preference if we should do this in preprocess vs get_or_compile.

Besides, can you

  • simplify tests
  • there's one more nit comment

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
@ruisearch42 ruisearch42 added the go add ONLY when ready to merge, run all tests label Aug 22, 2024
@rkooo567 rkooo567 enabled auto-merge (squash) August 22, 2024 00:41
@github-actions github-actions bot disabled auto-merge August 22, 2024 05:10
@can-anyscale can-anyscale merged commit aa3a0a7 into ray-project:master Aug 22, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misleading error message when decorating channel from an actor to itself with NCCL type hint
5 participants