Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Non-atomic shutdown logic may lead to unexpected behavior #36325

Closed
edoakes opened this issue Jun 12, 2023 · 6 comments · Fixed by #36927
Closed

[serve] Non-atomic shutdown logic may lead to unexpected behavior #36325

edoakes opened this issue Jun 12, 2023 · 6 comments · Fixed by #36927
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release serve Ray Serve Related Issue

Comments

@edoakes
Copy link
Contributor

edoakes commented Jun 12, 2023

The shutdown logic currently happens in three steps from the client:

  1. we call shutdown.remote() on the controller, which signals that the proxies and deployment should be shut down
  2. we wait for all deployments to be shut down
  3. we ray.kill the controller

However, because these three operations all run from the client, the user could interrupt them and cause the controller to be shut down but not killed. This can lead to an unexpected state, such as the controller silently ignoring updates or no longer health checking HTTP proxies.

We should make the shutdown logic atomic from the perspective of the controller: this should set a shutting_down flag that causes the normal update loop to shut down the components and then have the controller exit itself using exit_actor. We should also reject any state updates once the controller is shutting down.

@edoakes edoakes added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue labels Jun 12, 2023
@edoakes
Copy link
Contributor Author

edoakes commented Jun 12, 2023

Next step: @GeneDer to reproduce the "phantom HTTP proxy" issue we saw by manually triggering this issue.

@edoakes
Copy link
Contributor Author

edoakes commented Jun 12, 2023

Let's also please improve the logging around shutdown when we fix this.

@GeneDer
Copy link
Contributor

GeneDer commented Jun 12, 2023

This is confirmed with the following step to reproduce:

  1. add a long sleep of 10s https://github.com/ray-project/ray/blob/ray-2.5.0/python/ray/serve/_private/client.py#L110
  2. start ray locally with ray start --head
  3. run ray with config locally serve run config.yaml
  4. checked both http://localhost:8000/ and http://localhost:8265/#/serve looks good at this point
  5. control + c twice to force kill
  6. http://localhost:8000/ is not reachable at this point
  7. http://localhost:8265/#/serve continue to show proxy healthy and the deployment is running
  8. http://localhost:8265/#/actors is showing HTTPProxyActor has a state of DEAD

@edoakes edoakes added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Jul 7, 2023
@edoakes
Copy link
Contributor Author

edoakes commented Jul 7, 2023

This has been happening frequently on anyscale workspaces (reported by customers and internal folks), so bumping it to a P0 release blocker.

@edoakes edoakes added the release-blocker P0 Issue that blocks the release label Jul 7, 2023
@zhe-thoughts
Copy link
Collaborator

Small point: let's close release-blocking issues only after the cherry-pick PR is merged into the release branch

@zhe-thoughts zhe-thoughts reopened this Jul 7, 2023
@GeneDer
Copy link
Contributor

GeneDer commented Jul 10, 2023

Cherry-pick PR is merged #37211

@GeneDer GeneDer closed this as completed Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants