-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][aDAG] support multi readers in multi node when dag is created from an actor #47601
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks good to me. My only concern is how much overhead will be introduced with the additional proxy actor? The overhead may be significant, especially since the main use case for creating a DAG from an actor is Ray Serve, which is sensitive to latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. My only concern is how much overhead will be introduced with the additional proxy actor when the driver is an actor?
@@ -66,27 +66,32 @@ def do_allocate_channel( | |||
self, | |||
reader_and_node_list: List[Tuple["ray.actor.ActorHandle", str]], | |||
typ: ChannelOutputType, | |||
read_by_adag_driver: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this equivalent to is_adag_output_channel
? Should we use that name which seems easier to follow
Can we have a default value = False as most of the cases would be false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better not having default as it is kind of important flag that can cause hangs if set incorrectly.
Regarding the name, I thought is_adag_output_channel
exposes more implementation detail than this (that read by driver == output), and that was the original name before I changed it to this one. I don't have strong preference. Lmk if you want me to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with adag_driver
is it's not a well defined concept so you need to explain it everywhere and can be confused with ray driver. I don't have a good idea right now, so will leave it to you to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm imo it is a reasonable wording, so unless there's strong pushback, I will probably just keep it
python/ray/dag/compiled_dag_node.py
Outdated
"Compiled DAGs currently require the InputNode() to be the " | ||
"driver process or an actor method. Ray task is not supported." | ||
) | ||
def _get_proxy_actor() -> "ray.actor.ActorHandle": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _create_proxy_actor
@@ -246,6 +246,9 @@ def _execute_impl(self, *args, **kwargs): | |||
def __str__(self) -> str: | |||
return get_dag_node_str(self, f"{self._method_name}()") | |||
|
|||
def __repr__(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: why is this needed?
(repr is not added in other DAGNode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was to make debugging easy when it was nested inside a list (in this case, repr is used).
I believe this mechanism is not supposed to introduce any additional delay (it is how driver -> actor works). What's the potential overhead coming from? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
It needs to go through an additional proxy actor in this case. |
@kevin85421 Does it affect the runtime performance? IIUC, we are not going through proxy actor when we read, so it should be okay. |
Oh, I missed that part. Thanks! |
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Why are these changes needed?
Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself.
This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call
ray.get(a.allocate_channel.remote())
for a downstream actor while the downstream actor callsray.get(driver_actor.create_ref.remote())
.This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.