Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] appropriately handle actor died and actor unavailable in all cases #47008

Merged
merged 10 commits into from
Aug 12, 2024

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Aug 8, 2024

Why are these changes needed?

Sometimes the replicas that a router holds a handle to died, and sometimes they are temporarily unreachable because of network issues. Previously we handled ActorDiedError and ActorUnavailableError when the router actively probed a replica.

Breakdown of scenarios for a replica:
(1) cache is turned off
(2) cache is turned on and replica's cached entry hasn't expired
(3) cache is turned on and replica's cached entry has expired

In case (1) and (3), we actively probe the replica, so the errors are dealt with appropriately. However in case (2), we do not handle the errors so we could repeatedly try to send requests to a dead replica or an unavailable one, and this can persist for up to 10s (or however long the cache timeout is).

This PR adds appropriate handling of ActorDiedError and ActorUnavailableError upon receiving the system message after sending a request.

  • For ActorDiedError, the replica will be removed from the router's replica set.
  • For ActorUnavailableError, the replica's cache entry will be invalidated so that the router will not try to send requests to it without actively probing.

Example log printed by router:

2024-08-08 11:37:14,390 WARNING router.py:528 -- Replica(id='q3ji1fgv', deployment='Dummy', app='default') will not be considered for future requests because it has died.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin marked this pull request as ready for review August 8, 2024 18:37
python/ray/serve/_private/replica_scheduler/common.py Outdated Show resolved Hide resolved
python/ray/serve/tests/test_failure.py Outdated Show resolved Hide resolved
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
python/ray/serve/_private/replica_scheduler/common.py Outdated Show resolved Hide resolved
python/ray/serve/_private/router.py Outdated Show resolved Hide resolved
python/ray/serve/tests/unit/test_router.py Outdated Show resolved Hide resolved
@edoakes
Copy link
Contributor

edoakes commented Aug 9, 2024

High level looks good, only stylistic comments

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin added the go add ONLY when ready to merge, run all tests label Aug 11, 2024
@edoakes edoakes merged commit 9622ca9 into ray-project:master Aug 12, 2024
6 checks passed
@zcin zcin deleted the handle-actor-error branch August 21, 2024 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants