-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Recover PENDING_INITIALIZATION status actor #33890
Conversation
869c691
to
46b32f7
Compare
0c7e38b
to
7034d32
Compare
…ecover bug Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
7034d32
to
7221899
Compare
for actor in all_current_actors: | ||
if name in actor["name"]: | ||
return actor["name"] | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return None |
class V1: | ||
def __init__(self): | ||
signal2.send.remote() | ||
ray.get(signal.wait.remote()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray.get(signal.wait.remote()) | |
await signal.wait.remote() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it make no difference for this test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the controller able to call the replica during recovery while the replica is stuck on ray.get()
, since it's a blocking call? I think if the test wants to test the buggy code path, it needs to be able to call into the replica actor while the replica is waiting on signal.wait.remote()
python/ray/serve/_private/replica.py
Outdated
# under initialization. Relying on DeploymentStateManager to | ||
# timeout the replica. | ||
while self.replica is None: | ||
await asyncio.sleep(0.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
@@ -497,7 +497,10 @@ def check_ready(self) -> Tuple[ReplicaStartupStatus, Optional[DeploymentVersion] | |||
self._allocated_obj_ref | |||
) | |||
except Exception: | |||
logger.exception(f"Exception in deployment '{self._deployment_name}'") | |||
logger.exception( | |||
f"Exception in replica '{self._replica_tag}'," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Exception in replica '{self._replica_tag}'," | |
f"Exception in replica '{self._replica_tag}', " |
python/ray/serve/_private/replica.py
Outdated
# When the self.replica variable is None, potentially replica is still | ||
# under initialization. Relying on DeploymentStateManager to | ||
# timeout the replica. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little lost at why this change is needed. Is it because the constructor might still be running while we are recovering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, can we use an explicit asyncio.Event
instead of implementing our own sleeping logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., self._initialized = asyncio.Event()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little lost at why this change is needed. Is it because the constructor might still be running while we are recovering?
yes
super, I forget this Async variable! updated.
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Could you give some more detail on why this error appears when the controller is dead and a replica is initializing? I don't understand why that causes self.replica
to be None
.
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on the changes!
@sihanwang41 a serve AIR integration test is failing, maybe related? |
^^ @zhe-thoughts observe this bug from the release test, here is the fix. |
master is unfrozen, no need to have Zhe approval right now. ready to merge |
Then replica is under initializing state and the controller is dead, user will see ``` (ServeController pid=458318) ERROR 2023-03-29 08:44:42,547 controller 458318 deployment_state.py:500 - Exception in deployment 'xqHWInctmH' (ServeController pid=458318) Traceback (most recent call last): (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/deployment_state.py", line 489, in check_ready (ServeController pid=458318) deployment_config, version = ray.get(self._ready_obj_ref) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper (ServeController pid=458318) return func(*args, **kwargs) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2508, in get (ServeController pid=458318) raise value.as_instanceof_cause() (ServeController pid=458318) ray.exceptions.RayTaskError(AttributeError): ray::ServeReplica:xqHWInctmH.get_metadata() (pid=458148, ip=172.31.126.222, repr=<ray.serve._private.replica.ServeReplica:xqHWInctmH object at 0x7f57ae0f0110>) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result (ServeController pid=458318) return self.__get_result() (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result (ServeController pid=458318) raise self._exception (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 249, in get_metadata (ServeController pid=458318) return self.replica.deployment_config, self.replica.version (ServeController pid=458318) AttributeError: 'NoneType' object has no attribute 'deployment_config' ``` - Fix the NoneType bug when recover happens, no matter the actor is under any state, user would not see the traceback. Instead user will see the 1. slow startup warning, and then replica will be terminated, and new replica will be provisioned. 2. If actor succeed in `PENDING_INITIALIZATION`, no error pops out. This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history **Note**: If user constructor initialization just needs very long time to finish, it is recommended to increase the `SERVE_SLOW_STARTUP_WARNING_S`, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.) Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Then replica is under initializing state and the controller is dead, user will see ``` (ServeController pid=458318) ERROR 2023-03-29 08:44:42,547 controller 458318 deployment_state.py:500 - Exception in deployment 'xqHWInctmH' (ServeController pid=458318) Traceback (most recent call last): (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/deployment_state.py", line 489, in check_ready (ServeController pid=458318) deployment_config, version = ray.get(self._ready_obj_ref) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper (ServeController pid=458318) return func(*args, **kwargs) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2508, in get (ServeController pid=458318) raise value.as_instanceof_cause() (ServeController pid=458318) ray.exceptions.RayTaskError(AttributeError): ray::ServeReplica:xqHWInctmH.get_metadata() (pid=458148, ip=172.31.126.222, repr=<ray.serve._private.replica.ServeReplica:xqHWInctmH object at 0x7f57ae0f0110>) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result (ServeController pid=458318) return self.__get_result() (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result (ServeController pid=458318) raise self._exception (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 249, in get_metadata (ServeController pid=458318) return self.replica.deployment_config, self.replica.version (ServeController pid=458318) AttributeError: 'NoneType' object has no attribute 'deployment_config' ``` - Fix the NoneType bug when recover happens, no matter the actor is under any state, user would not see the traceback. Instead user will see the 1. slow startup warning, and then replica will be terminated, and new replica will be provisioned. 2. If actor succeed in `PENDING_INITIALIZATION`, no error pops out. This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history **Note**: If user constructor initialization just needs very long time to finish, it is recommended to increase the `SERVE_SLOW_STARTUP_WARNING_S`, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.) Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Then replica is under initializing state and the controller is dead, user will see ``` (ServeController pid=458318) ERROR 2023-03-29 08:44:42,547 controller 458318 deployment_state.py:500 - Exception in deployment 'xqHWInctmH' (ServeController pid=458318) Traceback (most recent call last): (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/deployment_state.py", line 489, in check_ready (ServeController pid=458318) deployment_config, version = ray.get(self._ready_obj_ref) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper (ServeController pid=458318) return func(*args, **kwargs) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2508, in get (ServeController pid=458318) raise value.as_instanceof_cause() (ServeController pid=458318) ray.exceptions.RayTaskError(AttributeError): ray::ServeReplica:xqHWInctmH.get_metadata() (pid=458148, ip=172.31.126.222, repr=<ray.serve._private.replica.ServeReplica:xqHWInctmH object at 0x7f57ae0f0110>) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result (ServeController pid=458318) return self.__get_result() (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result (ServeController pid=458318) raise self._exception (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 249, in get_metadata (ServeController pid=458318) return self.replica.deployment_config, self.replica.version (ServeController pid=458318) AttributeError: 'NoneType' object has no attribute 'deployment_config' ``` - Fix the NoneType bug when recover happens, no matter the actor is under any state, user would not see the traceback. Instead user will see the 1. slow startup warning, and then replica will be terminated, and new replica will be provisioned. 2. If actor succeed in `PENDING_INITIALIZATION`, no error pops out. This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history **Note**: If user constructor initialization just needs very long time to finish, it is recommended to increase the `SERVE_SLOW_STARTUP_WARNING_S`, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.) Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Signed-off-by: elliottower <elliot@elliottower.com>
Then replica is under initializing state and the controller is dead, user will see ``` (ServeController pid=458318) ERROR 2023-03-29 08:44:42,547 controller 458318 deployment_state.py:500 - Exception in deployment 'xqHWInctmH' (ServeController pid=458318) Traceback (most recent call last): (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/deployment_state.py", line 489, in check_ready (ServeController pid=458318) deployment_config, version = ray.get(self._ready_obj_ref) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper (ServeController pid=458318) return func(*args, **kwargs) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2508, in get (ServeController pid=458318) raise value.as_instanceof_cause() (ServeController pid=458318) ray.exceptions.RayTaskError(AttributeError): ray::ServeReplica:xqHWInctmH.get_metadata() (pid=458148, ip=172.31.126.222, repr=<ray.serve._private.replica.ServeReplica:xqHWInctmH object at 0x7f57ae0f0110>) (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result (ServeController pid=458318) return self.__get_result() (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result (ServeController pid=458318) raise self._exception (ServeController pid=458318) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 249, in get_metadata (ServeController pid=458318) return self.replica.deployment_config, self.replica.version (ServeController pid=458318) AttributeError: 'NoneType' object has no attribute 'deployment_config' ``` - Fix the NoneType bug when recover happens, no matter the actor is under any state, user would not see the traceback. Instead user will see the 1. slow startup warning, and then replica will be terminated, and new replica will be provisioned. 2. If actor succeed in `PENDING_INITIALIZATION`, no error pops out. This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history **Note**: If user constructor initialization just needs very long time to finish, it is recommended to increase the `SERVE_SLOW_STARTUP_WARNING_S`, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.) Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
Then replica is under initializing state and the controller is dead, user will see
PENDING_INITIALIZATION
, no error pops out.This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history
Note: If user constructor initialization just needs very long time to finish, it is recommended to increase the
SERVE_SLOW_STARTUP_WARNING_S
, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.)Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.