-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Non-atomic shutdown logic may lead to unexpected behavior #36325
Comments
Next step: @GeneDer to reproduce the "phantom HTTP proxy" issue we saw by manually triggering this issue. |
Let's also please improve the logging around shutdown when we fix this. |
This is confirmed with the following step to reproduce:
|
This has been happening frequently on anyscale workspaces (reported by customers and internal folks), so bumping it to a P0 release blocker. |
Small point: let's close release-blocking issues only after the cherry-pick PR is merged into the release branch |
Cherry-pick PR is merged #37211 |
The shutdown logic currently happens in three steps from the client:
shutdown.remote()
on the controller, which signals that the proxies and deployment should be shut downray.kill
the controllerHowever, because these three operations all run from the client, the user could interrupt them and cause the controller to be shut down but not killed. This can lead to an unexpected state, such as the controller silently ignoring updates or no longer health checking HTTP proxies.
We should make the
shutdown
logic atomic from the perspective of the controller: this should set ashutting_down
flag that causes the normal update loop to shut down the components and then have the controller exit itself usingexit_actor
. We should also reject any state updates once the controller is shutting down.The text was updated successfully, but these errors were encountered: