-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aDAG] Raise proper error message for nccl within the same actor #47250
Conversation
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be duplicate with:
ray/python/ray/dag/compiled_dag_node.py
Lines 1408 to 1415 in c50e3b6
logger.error( | |
"Detected a deadlock caused by using NCCL channels to " | |
f"transfer data between the task `{method}` and " | |
f"its downstream method `{downstream_method}` on the same " | |
f"actor {actor_handle}. Please remove " | |
'`TorchTensorType(transport="nccl")` between ' | |
"DAG nodes on the same actor." | |
) |
Good call @kevin85421 @woshiyyya just to confirm, you got the misleading error message when |
@ruisearch42 yes I manually set |
@woshiyyya why do you need to manually set this now? do we understand the problem? |
@rkooo567 yeah I think we still need to disable it. In the distMM DAG, we still have a "crossing" : |
I think we just need to update the deadlock detection, and then we can consider using IntraProcessChannel directly when we find two DAG nodes are on the same actor. |
I think we will need a new algorithm? How long would that take? I feel that's not trivial and may take some time, right? |
How about we make the change as lightweight as possible? For example, we can add an assert in: ray/python/ray/experimental/channel/torch_tensor_nccl_channel.py Lines 376 to 380 in 63d6af3
instead to make sure the sender / receiver are not the same rank. |
In addition, it is better not to update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: separate the check from _detect_deadlock
, and then checked before _detect_deadlock
.
ray/python/ray/dag/compiled_dag_node.py
Lines 1067 to 1076 in c50e3b6
from ray.dag.constants import RAY_ADAG_ENABLE_DETECT_DEADLOCK | |
if RAY_ADAG_ENABLE_DETECT_DEADLOCK and self._detect_deadlock(): | |
raise ValueError( | |
"This DAG cannot be compiled because it will deadlock on NCCL " | |
"calls. If you believe this is a false positive, please disable " | |
"the graph verification by setting the environment variable " | |
"RAY_ADAG_ENABLE_DETECT_DEADLOCK to 0 and file an issue at " | |
"https://github.com/ray-project/ray/issues/new/." | |
) |
That is,
from ray.dag.constants import RAY_ADAG_ENABLE_DETECT_DEADLOCK
# detect whether using NCCL to pass tensors between DAG nodes on the same actor.
if RAY_ADAG_ENABLE_DETECT_DEADLOCK and self._detect_deadlock():
raise ValueError(
"This DAG cannot be compiled because it will deadlock on NCCL "
"calls. If you believe this is a false positive, please disable "
"the graph verification by setting the environment variable "
"RAY_ADAG_ENABLE_DETECT_DEADLOCK to 0 and file an issue at "
"https://github.com/ray-project/ray/issues/new/."
)
Interesting thought. We want to have compile time checks rather than runtime checks though. |
That's the standard place where we do input validations so I wouldn't worry about it. |
Yep, that's why I prefer to detect in deadlock detection. I just thought as a workaround. I prefer to make it as light-weight as possible.
It's ok for me. It's just my personal preference to unify the validation logic so that we can easily manage it. I will start reviewing another part. Would you mind opening an issue to track the progress of the follow-up? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have special preference if we should do this in preprocess vs get_or_compile.
Besides, can you
- simplify tests
- there's one more nit comment
Why are these changes needed?
When NCCL type hint is used between methods from the same actor, misleading error message is raised:
We should raise error message with proper information and is actionable.
Note that when user specifies a NCCL type hint between methods of the same actor, we don't want to implicitly change to use
IntroProcessChannel
underneath, which would make the behavior of aDAG ambiguous. Instead, we should raise a clear error and the user would be able to easily fix their aDAG program by removing the type hint.Related issue number
Closes #47235
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.