[aDAG] Raise proper error message for nccl within the same actor #47250

ruisearch42 · 2024-08-21T16:08:17Z

Why are these changes needed?

When NCCL type hint is used between methods from the same actor, misleading error message is raised:

(MyActor pid=95377) AssertionError: Channel 70fb3f47-dfcc-47f8-81b7-53a5fcf56dce does not exist in the buffer.

We should raise error message with proper information and is actionable.

Note that when user specifies a NCCL type hint between methods of the same actor, we don't want to implicitly change to use IntroProcessChannel underneath, which would make the behavior of aDAG ambiguous. Instead, we should raise a clear error and the user would be able to easily fix their aDAG program by removing the type hint.

Related issue number

Closes #47235

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

python/ray/dag/compiled_dag_node.py

python/ray/dag/tests/experimental/test_torch_tensor_dag.py

python/ray/dag/compiled_dag_node.py

kevin85421

This seems to be duplicate with:

ray/python/ray/dag/compiled_dag_node.py

Lines 1408 to 1415 in c50e3b6

    
           logger.error( 
        
               "Detected a deadlock caused by using NCCL channels to " 
        
               f"transfer data between the task `{method}` and " 
        
               f"its downstream method `{downstream_method}` on the same " 
        
               f"actor {actor_handle}. Please remove " 
        
               '`TorchTensorType(transport="nccl")` between ' 
        
               "DAG nodes on the same actor." 
        
           )

ruisearch42 · 2024-08-21T18:46:42Z

This seems to be duplicate with:

ray/python/ray/dag/compiled_dag_node.py

Lines 1408 to 1415 in c50e3b6

logger.error(

"Detected a deadlock caused by using NCCL channels to "

f"transfer data between the task `{method}` and "

f"its downstream method `{downstream_method}` on the same "

f"actor {actor_handle}. Please remove "

'`TorchTensorType(transport="nccl")` between '

"DAG nodes on the same actor."

)

Good call @kevin85421
I think this is by default enabled, right?

@woshiyyya just to confirm, you got the misleading error message when RAY_ADAG_ENABLE_DETECT_DEADLOCK is manually turned off?

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

woshiyyya · 2024-08-21T20:31:56Z

@ruisearch42 yes I manually set RAY_ADAG_ENABLE_DETECT_DEADLOCK=0.

rkooo567 · 2024-08-21T20:35:24Z

@woshiyyya why do you need to manually set this now? do we understand the problem?

woshiyyya · 2024-08-21T20:41:41Z

@rkooo567 yeah I think we still need to disable it.

In the distMM DAG, we still have a "crossing" : text1.agg_act -> vision1.bwd and vision1.agg_act -> text1.bwd here. If we don't disable it, the deadlock detection algorithm will raise an error, because it is designed for the ADAG before this PR: #46911.

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

kevin85421 · 2024-08-21T23:06:12Z

I think we just need to update the deadlock detection, and then we can consider using IntraProcessChannel directly when we find two DAG nodes are on the same actor.

ruisearch42 · 2024-08-21T23:09:59Z

I think we just need to update the deadlock detection.

I think we will need a new algorithm? How long would that take? I feel that's not trivial and may take some time, right?

kevin85421 · 2024-08-21T23:17:14Z

How about we make the change as lightweight as possible? For example, we can add an assert in:

ray/python/ray/experimental/channel/torch_tensor_nccl_channel.py

Lines 376 to 380 in 63d6af3

    
           for tensor in tensors: 
        
               # TODO: If there are multiple readers, can replace with a 
        
               # broadcast. 
        
               for rank in self._reader_ranks: 
        
                   self._nccl_group.send(tensor, rank)

instead to make sure the sender / receiver are not the same rank.

kevin85421 · 2024-08-21T23:18:13Z

In addition, it is better not to update preprocess in my opinion. preprocess is a recursive function. Updating it may introduce more complexity than our imagination.

kevin85421

Another thought: separate the check from _detect_deadlock, and then checked before _detect_deadlock.

ray/python/ray/dag/compiled_dag_node.py

Lines 1067 to 1076 in c50e3b6

    
           from ray.dag.constants import RAY_ADAG_ENABLE_DETECT_DEADLOCK 
        
           if RAY_ADAG_ENABLE_DETECT_DEADLOCK and self._detect_deadlock(): 
        
               raise ValueError( 
        
                   "This DAG cannot be compiled because it will deadlock on NCCL " 
        
                   "calls. If you believe this is a false positive, please disable " 
        
                   "the graph verification by setting the environment variable " 
        
                   "RAY_ADAG_ENABLE_DETECT_DEADLOCK to 0 and file an issue at " 
        
                   "https://github.com/ray-project/ray/issues/new/." 
        
               )

That is,

        from ray.dag.constants import RAY_ADAG_ENABLE_DETECT_DEADLOCK

        # detect whether using NCCL to pass tensors between DAG nodes on the same actor.

        if RAY_ADAG_ENABLE_DETECT_DEADLOCK and self._detect_deadlock():
            raise ValueError(
                "This DAG cannot be compiled because it will deadlock on NCCL "
                "calls. If you believe this is a false positive, please disable "
                "the graph verification by setting the environment variable "
                "RAY_ADAG_ENABLE_DETECT_DEADLOCK to 0 and file an issue at "
                "https://github.com/ray-project/ray/issues/new/."
            )

ruisearch42 · 2024-08-21T23:39:41Z

How about we make the change as lightweight as possible? For example, we can add an assert in:

ray/python/ray/experimental/channel/torch_tensor_nccl_channel.py

Lines 376 to 380 in 63d6af3

for tensor in tensors:

# TODO: If there are multiple readers, can replace with a

# broadcast.

for rank in self._reader_ranks:

self._nccl_group.send(tensor, rank)

instead to make sure the sender / receiver are not the same rank.

Interesting thought. We want to have compile time checks rather than runtime checks though.

ruisearch42 · 2024-08-21T23:40:39Z

In addition, it is better not to update preprocess in my opinion. preprocess is a recursive function. Updating it may introduce more complexity than our imagination.

That's the standard place where we do input validations so I wouldn't worry about it.

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

kevin85421 · 2024-08-21T23:58:39Z

Interesting thought. We want to have compile time checks rather than runtime checks though.

Yep, that's why I prefer to detect in deadlock detection. I just thought as a workaround. I prefer to make it as light-weight as possible.

That's the standard place where we do input validations so I wouldn't worry about it.

It's ok for me. It's just my personal preference to unify the validation logic so that we can easily manage it.

I will start reviewing another part. Would you mind opening an issue to track the progress of the follow-up?

rkooo567

I don't have special preference if we should do this in preprocess vs get_or_compile.

Besides, can you

simplify tests
there's one more nit comment

python/ray/dag/compiled_dag_node.py

python/ray/dag/tests/experimental/test_torch_tensor_dag.py

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

[aDAG] Raise proper error message for nccl within the same actor

cc885a8

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 assigned rkooo567, kevin85421 and woshiyyya Aug 21, 2024

rkooo567 reviewed Aug 21, 2024

View reviewed changes

python/ray/dag/compiled_dag_node.py Show resolved Hide resolved

python/ray/dag/tests/experimental/test_torch_tensor_dag.py Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Show resolved Hide resolved

kevin85421 reviewed Aug 21, 2024

View reviewed changes

up

f792b6a

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 added 2 commits August 21, 2024 15:27

up

bb476c5

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

a4d97ee

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

kevin85421 reviewed Aug 21, 2024

View reviewed changes

up

31f19f7

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

rkooo567 approved these changes Aug 21, 2024

View reviewed changes

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

kevin85421 reviewed Aug 22, 2024

View reviewed changes

ruisearch42 added 2 commits August 21, 2024 17:20

up

c742765

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

9107501

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 added the go add ONLY when ready to merge, run all tests label Aug 22, 2024

kevin85421 approved these changes Aug 22, 2024

View reviewed changes

rkooo567 enabled auto-merge (squash) August 22, 2024 00:41

Merge branch 'master' into actor_self_nccl

f6ca317

github-actions bot disabled auto-merge August 22, 2024 05:10

can-anyscale merged commit aa3a0a7 into ray-project:master Aug 22, 2024
4 of 5 checks passed

ruisearch42 assigned ruisearch42 and unassigned rkooo567, kevin85421 and woshiyyya Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aDAG] Raise proper error message for nccl within the same actor #47250

[aDAG] Raise proper error message for nccl within the same actor #47250

ruisearch42 commented Aug 21, 2024 •

edited

Loading

kevin85421 left a comment

ruisearch42 commented Aug 21, 2024

woshiyyya commented Aug 21, 2024 •

edited

Loading

rkooo567 commented Aug 21, 2024

woshiyyya commented Aug 21, 2024 •

edited

Loading

kevin85421 commented Aug 21, 2024 •

edited

Loading

ruisearch42 commented Aug 21, 2024

kevin85421 commented Aug 21, 2024

kevin85421 commented Aug 21, 2024

kevin85421 left a comment

ruisearch42 commented Aug 21, 2024

ruisearch42 commented Aug 21, 2024

kevin85421 commented Aug 21, 2024

rkooo567 left a comment

	logger.error(
	"Detected a deadlock caused by using NCCL channels to "
	f"transfer data between the task `{method}` and "
	f"its downstream method `{downstream_method}` on the same "
	f"actor {actor_handle}. Please remove "
	'`TorchTensorType(transport="nccl")` between '
	"DAG nodes on the same actor."
	)

	from ray.dag.constants import RAY_ADAG_ENABLE_DETECT_DEADLOCK

	if RAY_ADAG_ENABLE_DETECT_DEADLOCK and self._detect_deadlock():
	raise ValueError(
	"This DAG cannot be compiled because it will deadlock on NCCL "
	"calls. If you believe this is a false positive, please disable "
	"the graph verification by setting the environment variable "
	"RAY_ADAG_ENABLE_DETECT_DEADLOCK to 0 and file an issue at "
	"https://github.com/ray-project/ray/issues/new/."
	)

[aDAG] Raise proper error message for nccl within the same actor #47250

[aDAG] Raise proper error message for nccl within the same actor #47250

Conversation

ruisearch42 commented Aug 21, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 left a comment

Choose a reason for hiding this comment

ruisearch42 commented Aug 21, 2024

woshiyyya commented Aug 21, 2024 • edited Loading

rkooo567 commented Aug 21, 2024

woshiyyya commented Aug 21, 2024 • edited Loading

kevin85421 commented Aug 21, 2024 • edited Loading

ruisearch42 commented Aug 21, 2024

kevin85421 commented Aug 21, 2024

kevin85421 commented Aug 21, 2024

kevin85421 left a comment

Choose a reason for hiding this comment

ruisearch42 commented Aug 21, 2024

ruisearch42 commented Aug 21, 2024

kevin85421 commented Aug 21, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

ruisearch42 commented Aug 21, 2024 •

edited

Loading

woshiyyya commented Aug 21, 2024 •

edited

Loading

woshiyyya commented Aug 21, 2024 •

edited

Loading

kevin85421 commented Aug 21, 2024 •

edited

Loading