Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Fix serve non atomic shutdown #36927

Merged
merged 29 commits into from
Jul 7, 2023

Conversation

GeneDer
Copy link
Contributor

@GeneDer GeneDer commented Jun 28, 2023

Why are these changes needed?

Currently we are relying on the client to wait for all the resources before shutting off the controller. This caused the issue for when they interrupt the process and can cause incomplete shutdown. In this PR we moved the shutdown logic into the event loop which would be triggered by a _shutting_down flag on the controller. So even if the client interrupted the process, the controller will continue to shutdown all the resources and then kill itself.

Related issue number

Closes: #36325

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

GeneDer added 11 commits June 27, 2023 11:49
…up and add tests

Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
@GeneDer GeneDer force-pushed the fix-serve-non-atomic-shutdown branch from 8b82976 to 7f88970 Compare June 30, 2023 17:49
GeneDer added 4 commits June 30, 2023 12:45
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
@GeneDer
Copy link
Contributor Author

GeneDer commented Jun 30, 2023

Tested this locally by adding a long sleep in the client and deploy a test app.

Screenshot 2023-06-30 at 1 21 50 PM Screenshot 2023-06-30 at 1 21 52 PM

Shut off the client with control+c twice

Screenshot 2023-06-30 at 1 22 00 PM

Go back to the dashboard and seeing serve shutdown and all the actors are in DEAD state

Screenshot 2023-06-30 at 1 22 15 PM Screenshot 2023-06-30 at 1 22 13 PM

@GeneDer GeneDer changed the title [Not ready for review][Serve] Fix serve non atomic shutdown [Serve] Fix serve non atomic shutdown Jun 30, 2023
Signed-off-by: Gene Su <e870252314@gmail.com>
python/ray/serve/_private/application_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/client.py Show resolved Hide resolved
python/ray/serve/_private/client.py Outdated Show resolved Hide resolved
python/ray/serve/_private/client.py Outdated Show resolved Hide resolved
python/ray/serve/_private/client.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Show resolved Hide resolved
python/ray/serve/controller.py Show resolved Hide resolved
python/ray/serve/tests/test_application_state.py Outdated Show resolved Hide resolved
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you parametrize this test, so it also runs with two SIGINT commands to make sure that the behavior is as expected when the user kills Serve before it shuts down correctly?

python/ray/serve/_private/client.py Outdated Show resolved Hide resolved
python/ray/serve/_private/client.py Outdated Show resolved Hide resolved
python/ray/serve/_private/client.py Outdated Show resolved Hide resolved
python/ray/serve/_private/client.py Outdated Show resolved Hide resolved
GeneDer added 3 commits July 3, 2023 12:21
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
GeneDer added 2 commits July 5, 2023 12:06
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core logic looks good

python/ray/serve/_private/client.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
GeneDer added 4 commits July 5, 2023 16:00
…tdown

Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
python/ray/serve/_private/client.py Outdated Show resolved Hide resolved
python/ray/serve/_private/client.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_state.py Show resolved Hide resolved
python/ray/serve/_private/http_state.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/tests/test_standalone3.py Show resolved Hide resolved
GeneDer added 2 commits July 6, 2023 14:55
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
@GeneDer
Copy link
Contributor Author

GeneDer commented Jul 6, 2023

@edoakes just address all the comments. Hopefully all tests are still passing 🙏

… the actor is still alive

Signed-off-by: Gene Su <e870252314@gmail.com>
@edoakes edoakes merged commit 267b14e into ray-project:master Jul 7, 2023
edoakes pushed a commit to edoakes/ray that referenced this pull request Jul 7, 2023
Currently we are relying on the client to wait for all the resources before shutting off the controller. This caused the issue for when they interrupt the process and can cause incomplete shutdown. In this PR we moved the shutdown logic into the event loop which would be triggered by a `_shutting_down` flag on the controller. So even if the client interrupted the process, the controller will continue to shutdown all the resources and then kill itself.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@GeneDer GeneDer deleted the fix-serve-non-atomic-shutdown branch July 7, 2023 18:56
bveeramani pushed a commit that referenced this pull request Jul 7, 2023
Currently we are relying on the client to wait for all the resources before shutting off the controller. This caused the issue for when they interrupt the process and can cause incomplete shutdown. In this PR we moved the shutdown logic into the event loop which would be triggered by a `_shutting_down` flag on the controller. So even if the client interrupted the process, the controller will continue to shutdown all the resources and then kill itself.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Gene Der Su <e870252314@gmail.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Currently we are relying on the client to wait for all the resources before shutting off the controller. This caused the issue for when they interrupt the process and can cause incomplete shutdown. In this PR we moved the shutdown logic into the event loop which would be triggered by a `_shutting_down` flag on the controller. So even if the client interrupted the process, the controller will continue to shutdown all the resources and then kill itself.

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[serve] Non-atomic shutdown logic may lead to unexpected behavior
3 participants