[Core][aDAG] support multi readers in multi node when dag is created from an actor #47601

rkooo567 · 2024-09-11T06:24:05Z

Why are these changes needed?

Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself.

This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()).

This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

kevin85421

Overall, it looks good to me. My only concern is how much overhead will be introduced with the additional proxy actor? The overhead may be significant, especially since the main use case for creating a DAG from an actor is Ray Serve, which is sensitive to latency.

python/ray/experimental/channel/shared_memory_channel.py

python/ray/dag/compiled_dag_node.py

kevin85421

LGTM. My only concern is how much overhead will be introduced with the additional proxy actor when the driver is an actor?

ruisearch42 · 2024-09-13T23:21:08Z

python/ray/dag/compiled_dag_node.py

@@ -66,27 +66,32 @@ def do_allocate_channel(
    self,
    reader_and_node_list: List[Tuple["ray.actor.ActorHandle", str]],
    typ: ChannelOutputType,
+    read_by_adag_driver: bool,


Is this equivalent to is_adag_output_channel? Should we use that name which seems easier to follow

Can we have a default value = False as most of the cases would be false

I think it is better not having default as it is kind of important flag that can cause hangs if set incorrectly.

Regarding the name, I thought is_adag_output_channel exposes more implementation detail than this (that read by driver == output), and that was the original name before I changed it to this one. I don't have strong preference. Lmk if you want me to change.

The problem with adag_driver is it's not a well defined concept so you need to explain it everywhere and can be confused with ray driver. I don't have a good idea right now, so will leave it to you to decide.

Hmm imo it is a reasonable wording, so unless there's strong pushback, I will probably just keep it

ruisearch42 · 2024-09-13T23:23:22Z

python/ray/dag/compiled_dag_node.py

-                        "Compiled DAGs currently require the InputNode() to be the "
-                        "driver process or an actor method. Ray task is not supported."
-                    )
+        def _get_proxy_actor() -> "ray.actor.ActorHandle":


nit: _create_proxy_actor

python/ray/dag/compiled_dag_node.py

ruisearch42 · 2024-09-13T23:28:05Z

python/ray/dag/class_node.py

@@ -246,6 +246,9 @@ def _execute_impl(self, *args, **kwargs):
    def __str__(self) -> str:
        return get_dag_node_str(self, f"{self._method_name}()")

+    def __repr__(self) -> str:


Question: why is this needed?
(repr is not added in other DAGNode)

this was to make debugging easy when it was nested inside a list (in this case, repr is used).

rkooo567 · 2024-09-13T23:31:18Z

LGTM. My only concern is how much overhead will be introduced with the additional proxy actor when the driver is an actor?

I believe this mechanism is not supposed to introduce any additional delay (it is how driver -> actor works). What's the potential overhead coming from?

ruisearch42

Otherwise LGTM

kevin85421 · 2024-09-14T00:40:31Z

I believe this mechanism is not supposed to introduce any additional delay (it is how driver -> actor works). What's the potential overhead coming from?

Without this PR: actor 1 -> actor 2 -> actor 1
With this PR: actor 1 -> actor 2 -> proxy actor -> actor 1

It needs to go through an additional proxy actor in this case.

rkooo567 · 2024-09-14T01:22:20Z

@kevin85421 Does it affect the runtime performance? IIUC, we are not going through proxy actor when we read, so it should be okay.

kevin85421 · 2024-09-14T02:13:06Z

IIUC, we are not going through proxy actor when we read, so it should be okay.

Oh, I missed that part. Thanks!

…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>

Ubuntu added 4 commits September 11, 2024 06:23

ip

0451573

done

f38f88b

fixed

5a281bc

fixed test failures.

941f902

rkooo567 assigned kevin85421 and ruisearch42 Sep 11, 2024

kevin85421 reviewed Sep 11, 2024

View reviewed changes

python/ray/experimental/channel/shared_memory_channel.py Show resolved Hide resolved

python/ray/experimental/channel/shared_memory_channel.py Outdated Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

done

820185d

kevin85421 approved these changes Sep 13, 2024

View reviewed changes

ruisearch42 reviewed Sep 13, 2024

View reviewed changes

ruisearch42 approved these changes Sep 13, 2024

View reviewed changes

rkooo567 added 2 commits September 13, 2024 22:44

Merge branch 'master' into ed-test-

9c376a1

addressed code review

8574e82

rkooo567 enabled auto-merge (squash) September 14, 2024 08:11

github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 14, 2024

rkooo567 merged commit 4b2f6a0 into ray-project:master Sep 14, 2024
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][aDAG] support multi readers in multi node when dag is created from an actor #47601

[Core][aDAG] support multi readers in multi node when dag is created from an actor #47601

rkooo567 commented Sep 11, 2024 •

edited

Loading

kevin85421 left a comment

kevin85421 left a comment

ruisearch42 Sep 13, 2024

rkooo567 Sep 13, 2024

ruisearch42 Sep 13, 2024 •

edited

Loading

rkooo567 Sep 14, 2024

ruisearch42 Sep 13, 2024

ruisearch42 Sep 13, 2024

rkooo567 Sep 14, 2024

rkooo567 commented Sep 13, 2024

ruisearch42 left a comment

kevin85421 commented Sep 14, 2024

rkooo567 commented Sep 14, 2024

kevin85421 commented Sep 14, 2024

[Core][aDAG] support multi readers in multi node when dag is created from an actor #47601

[Core][aDAG] support multi readers in multi node when dag is created from an actor #47601

Conversation

rkooo567 commented Sep 11, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

ruisearch42 Sep 13, 2024

Choose a reason for hiding this comment

rkooo567 Sep 13, 2024

Choose a reason for hiding this comment

ruisearch42 Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

rkooo567 Sep 14, 2024

Choose a reason for hiding this comment

ruisearch42 Sep 13, 2024

Choose a reason for hiding this comment

ruisearch42 Sep 13, 2024

Choose a reason for hiding this comment

rkooo567 Sep 14, 2024

Choose a reason for hiding this comment

rkooo567 commented Sep 13, 2024

ruisearch42 left a comment

Choose a reason for hiding this comment

kevin85421 commented Sep 14, 2024

rkooo567 commented Sep 14, 2024

kevin85421 commented Sep 14, 2024

rkooo567 commented Sep 11, 2024 •

edited

Loading

ruisearch42 Sep 13, 2024 •

edited

Loading