Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Recover PENDING_INITIALIZATION status actor #33890

Merged
merged 15 commits into from
Apr 4, 2023

Conversation

sihanwang41
Copy link
Contributor

@sihanwang41 sihanwang41 commented Mar 29, 2023

Why are these changes needed?

Then replica is under initializing state and the controller is dead, user will see

(ServeController pid=458318) ERROR 2023-03-29 08:44:42,547 controller 458318 deployment_state.py:500 - Exception in deployment 'xqHWInctmH'
(ServeController pid=458318) Traceback (most recent call last):
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/deployment_state.py", line 489, in check_ready
(ServeController pid=458318)     deployment_config, version = ray.get(self._ready_obj_ref)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(ServeController pid=458318)     return func(*args, **kwargs)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2508, in get
(ServeController pid=458318)     raise value.as_instanceof_cause()
(ServeController pid=458318) ray.exceptions.RayTaskError(AttributeError): ray::ServeReplica:xqHWInctmH.get_metadata() (pid=458148, ip=172.31.126.222, repr=<ray.serve._private.replica.ServeReplica:xqHWInctmH object at 0x7f57ae0f0110>)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
(ServeController pid=458318)     return self.__get_result()
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
(ServeController pid=458318)     raise self._exception
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 249, in get_metadata
(ServeController pid=458318)     return self.replica.deployment_config, self.replica.version
(ServeController pid=458318) AttributeError: 'NoneType' object has no attribute 'deployment_config'
  • Fix the NoneType bug when recover happens, no matter the actor is under any state, user would not see the traceback. Instead user will see the
  1. slow startup warning, and then replica will be terminated, and new replica will be provisioned.
  2. If actor succeed in PENDING_INITIALIZATION, no error pops out.

This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history

Note: If user constructor initialization just needs very long time to finish, it is recommended to increase the SERVE_SLOW_STARTUP_WARNING_S, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.)

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@sihanwang41 sihanwang41 changed the title [Serve] Controller recover initializing replica [Serve] Raise proper exception when replica recover from initializing Mar 29, 2023
@sihanwang41 sihanwang41 force-pushed the recover_failure branch 3 times, most recently from 0c7e38b to 7034d32 Compare March 30, 2023 03:24
…ecover bug

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
@sihanwang41 sihanwang41 changed the title [Serve] Raise proper exception when replica recover from initializing [Serve] Recover PENDING_INITIALIZATION status actor Mar 30, 2023
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
for actor in all_current_actors:
if name in actor["name"]:
return actor["name"]
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return None

class V1:
def __init__(self):
signal2.send.remote()
ray.get(signal.wait.remote())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ray.get(signal.wait.remote())
await signal.wait.remote()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it make no difference for this test case.

Copy link
Contributor

@zcin zcin Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the controller able to call the replica during recovery while the replica is stuck on ray.get(), since it's a blocking call? I think if the test wants to test the buggy code path, it needs to be able to call into the replica actor while the replica is waiting on signal.wait.remote()

python/ray/serve/tests/test_controller_recovery.py Outdated Show resolved Hide resolved
# under initialization. Relying on DeploymentStateManager to
# timeout the replica.
while self.replica is None:
await asyncio.sleep(0.2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@@ -497,7 +497,10 @@ def check_ready(self) -> Tuple[ReplicaStartupStatus, Optional[DeploymentVersion]
self._allocated_obj_ref
)
except Exception:
logger.exception(f"Exception in deployment '{self._deployment_name}'")
logger.exception(
f"Exception in replica '{self._replica_tag}',"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Exception in replica '{self._replica_tag}',"
f"Exception in replica '{self._replica_tag}', "

Comment on lines 251 to 253
# When the self.replica variable is None, potentially replica is still
# under initialization. Relying on DeploymentStateManager to
# timeout the replica.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little lost at why this change is needed. Is it because the constructor might still be running while we are recovering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, can we use an explicit asyncio.Event instead of implementing our own sleeping logic here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., self._initialized = asyncio.Event()

Copy link
Contributor Author

@sihanwang41 sihanwang41 Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little lost at why this change is needed. Is it because the constructor might still be running while we are recovering?

yes

super, I forget this Async variable! updated.

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Could you give some more detail on why this error appears when the controller is dead and a replica is initializing? I don't understand why that causes self.replica to be None.

python/ray/serve/_private/deployment_state.py Show resolved Hide resolved
python/ray/serve/_private/replica.py Outdated Show resolved Hide resolved
python/ray/serve/tests/test_controller_recovery.py Outdated Show resolved Hide resolved
python/ray/serve/tests/test_controller_recovery.py Outdated Show resolved Hide resolved
python/ray/serve/tests/test_controller_recovery.py Outdated Show resolved Hide resolved
python/ray/serve/_private/replica.py Outdated Show resolved Hide resolved
sihanwang41 and others added 7 commits April 3, 2023 12:04
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on the changes!

@edoakes
Copy link
Contributor

edoakes commented Apr 4, 2023

@sihanwang41 a serve AIR integration test is failing, maybe related?

@sihanwang41
Copy link
Contributor Author

@edoakes it is not related, Cindy's pr #34039 also fails with same reason. looks like it is a separate issue.

@sihanwang41 sihanwang41 closed this Apr 4, 2023
@sihanwang41 sihanwang41 reopened this Apr 4, 2023
@sihanwang41
Copy link
Contributor Author

^^ @zhe-thoughts observe this bug from the release test, here is the fix.

@sihanwang41
Copy link
Contributor Author

master is unfrozen, no need to have Zhe approval right now. ready to merge

@edoakes edoakes merged commit b9f3b8d into ray-project:master Apr 4, 2023
sihanwang41 added a commit to sihanwang41/ray that referenced this pull request Apr 11, 2023
Then replica is under initializing state and the controller is dead, user will see 
```
(ServeController pid=458318) ERROR 2023-03-29 08:44:42,547 controller 458318 deployment_state.py:500 - Exception in deployment 'xqHWInctmH'
(ServeController pid=458318) Traceback (most recent call last):
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/deployment_state.py", line 489, in check_ready
(ServeController pid=458318)     deployment_config, version = ray.get(self._ready_obj_ref)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(ServeController pid=458318)     return func(*args, **kwargs)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2508, in get
(ServeController pid=458318)     raise value.as_instanceof_cause()
(ServeController pid=458318) ray.exceptions.RayTaskError(AttributeError): ray::ServeReplica:xqHWInctmH.get_metadata() (pid=458148, ip=172.31.126.222, repr=<ray.serve._private.replica.ServeReplica:xqHWInctmH object at 0x7f57ae0f0110>)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
(ServeController pid=458318)     return self.__get_result()
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
(ServeController pid=458318)     raise self._exception
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 249, in get_metadata
(ServeController pid=458318)     return self.replica.deployment_config, self.replica.version
(ServeController pid=458318) AttributeError: 'NoneType' object has no attribute 'deployment_config'
```

- Fix the NoneType bug when recover happens, no matter the actor is under any state, user would not see the traceback. Instead user will see the 
1.  slow startup warning, and then replica will be terminated, and new replica will be provisioned.
2. If actor succeed in `PENDING_INITIALIZATION`, no error pops out.

This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history

**Note**: If user constructor initialization just needs very long time to finish, it is recommended to increase the `SERVE_SLOW_STARTUP_WARNING_S`, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.)

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
clarng pushed a commit that referenced this pull request Apr 12, 2023
Then replica is under initializing state and the controller is dead, user will see 
```
(ServeController pid=458318) ERROR 2023-03-29 08:44:42,547 controller 458318 deployment_state.py:500 - Exception in deployment 'xqHWInctmH'
(ServeController pid=458318) Traceback (most recent call last):
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/deployment_state.py", line 489, in check_ready
(ServeController pid=458318)     deployment_config, version = ray.get(self._ready_obj_ref)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(ServeController pid=458318)     return func(*args, **kwargs)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2508, in get
(ServeController pid=458318)     raise value.as_instanceof_cause()
(ServeController pid=458318) ray.exceptions.RayTaskError(AttributeError): ray::ServeReplica:xqHWInctmH.get_metadata() (pid=458148, ip=172.31.126.222, repr=<ray.serve._private.replica.ServeReplica:xqHWInctmH object at 0x7f57ae0f0110>)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
(ServeController pid=458318)     return self.__get_result()
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
(ServeController pid=458318)     raise self._exception
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 249, in get_metadata
(ServeController pid=458318)     return self.replica.deployment_config, self.replica.version
(ServeController pid=458318) AttributeError: 'NoneType' object has no attribute 'deployment_config'
```

- Fix the NoneType bug when recover happens, no matter the actor is under any state, user would not see the traceback. Instead user will see the 
1.  slow startup warning, and then replica will be terminated, and new replica will be provisioned.
2. If actor succeed in `PENDING_INITIALIZATION`, no error pops out.

This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history

**Note**: If user constructor initialization just needs very long time to finish, it is recommended to increase the `SERVE_SLOW_STARTUP_WARNING_S`, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.)

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
Then replica is under initializing state and the controller is dead, user will see
```
(ServeController pid=458318) ERROR 2023-03-29 08:44:42,547 controller 458318 deployment_state.py:500 - Exception in deployment 'xqHWInctmH'
(ServeController pid=458318) Traceback (most recent call last):
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/deployment_state.py", line 489, in check_ready
(ServeController pid=458318)     deployment_config, version = ray.get(self._ready_obj_ref)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(ServeController pid=458318)     return func(*args, **kwargs)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2508, in get
(ServeController pid=458318)     raise value.as_instanceof_cause()
(ServeController pid=458318) ray.exceptions.RayTaskError(AttributeError): ray::ServeReplica:xqHWInctmH.get_metadata() (pid=458148, ip=172.31.126.222, repr=<ray.serve._private.replica.ServeReplica:xqHWInctmH object at 0x7f57ae0f0110>)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
(ServeController pid=458318)     return self.__get_result()
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
(ServeController pid=458318)     raise self._exception
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 249, in get_metadata
(ServeController pid=458318)     return self.replica.deployment_config, self.replica.version
(ServeController pid=458318) AttributeError: 'NoneType' object has no attribute 'deployment_config'
```

- Fix the NoneType bug when recover happens, no matter the actor is under any state, user would not see the traceback. Instead user will see the
1.  slow startup warning, and then replica will be terminated, and new replica will be provisioned.
2. If actor succeed in `PENDING_INITIALIZATION`, no error pops out.

This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history

**Note**: If user constructor initialization just needs very long time to finish, it is recommended to increase the `SERVE_SLOW_STARTUP_WARNING_S`, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.)

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
Then replica is under initializing state and the controller is dead, user will see
```
(ServeController pid=458318) ERROR 2023-03-29 08:44:42,547 controller 458318 deployment_state.py:500 - Exception in deployment 'xqHWInctmH'
(ServeController pid=458318) Traceback (most recent call last):
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/deployment_state.py", line 489, in check_ready
(ServeController pid=458318)     deployment_config, version = ray.get(self._ready_obj_ref)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(ServeController pid=458318)     return func(*args, **kwargs)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2508, in get
(ServeController pid=458318)     raise value.as_instanceof_cause()
(ServeController pid=458318) ray.exceptions.RayTaskError(AttributeError): ray::ServeReplica:xqHWInctmH.get_metadata() (pid=458148, ip=172.31.126.222, repr=<ray.serve._private.replica.ServeReplica:xqHWInctmH object at 0x7f57ae0f0110>)
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
(ServeController pid=458318)     return self.__get_result()
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
(ServeController pid=458318)     raise self._exception
(ServeController pid=458318)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 249, in get_metadata
(ServeController pid=458318)     return self.replica.deployment_config, self.replica.version
(ServeController pid=458318) AttributeError: 'NoneType' object has no attribute 'deployment_config'
```

- Fix the NoneType bug when recover happens, no matter the actor is under any state, user would not see the traceback. Instead user will see the
1.  slow startup warning, and then replica will be terminated, and new replica will be provisioned.
2. If actor succeed in `PENDING_INITIALIZATION`, no error pops out.

This is observed from long_running_serve_failure: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_u7xeve33e2djg9grgr9qcs9l4x?command-history-section=command_history

**Note**: If user constructor initialization just needs very long time to finish, it is recommended to increase the `SERVE_SLOW_STARTUP_WARNING_S`, (if the user doesn't change, the deployment manager will terminate the old replica and start new replica after the timeout.)

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants