[serve] Faster detection of dead replicas #47237

zcin · 2024-08-21T00:21:24Z

Why are these changes needed?

Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message.

cover one more case: process replica death if error is thrown while request was being processed on the replica.
improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request.

Performance evaluation

(master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42)

Latency:

metric	master	this PR	% change
http_p50_latency	3.9672044999932154	3.9794859999986443	0.31
http_1mb_p50_latency	4.283115999996312	4.1375990000034335	-3.4
http_10mb_p50_latency	8.212248500001351	8.056774499998198	-1.89
grpc_p50_latency	2.889802499964844	2.845889500008525	-1.52
grpc_1mb_p50_latency	6.320479999999407	9.85005449996379	55.84
grpc_10mb_p50_latency	92.12763850001693	106.14903449999247	15.22
handle_p50_latency	1.7775379999420693	1.6373455000575632	-7.89
handle_1mb_p50_latency	2.797253500034458	2.7225929999303844	-2.67
handle_10mb_p50_latency	11.619127000017215	11.39100950001648	-1.96

Throughput:

metric	master	this PR	% change
http_avg_rps	359.14	357.81	-0.37
http_100_max_ongoing_requests_avg_rps	507.21	515.71	1.68
grpc_avg_rps	506.16	485.92	-4.0
grpc_100_max_ongoing_requests_avg_rps	506.13	486.47	-3.88
handle_avg_rps	604.52	641.66	6.14
handle_100_max_ongoing_requests_avg_rps	1003.45	1039.15	3.56

Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe.

Related issue number

closes #47219

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…request Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

edoakes

Looks good, only nits. Thanks for the detailed benchmarking, it's fantastic! 💪

python/ray/serve/_private/router.py

edoakes · 2024-08-21T21:00:45Z

python/ray/serve/_private/router.py

+            logger.warning(
+                f"{replica_id} will not be considered for future "
+                "requests because it has died."
+            )


hm probably this message should just be logging inside the callback so it's consistent across callsites

do you mean to unify the logging when an error is received on system message (line 559) vs during actual request (line 505)?

yes and also during active probing

@edoakes Hmm, took a look at code and might be hard to unify. For active probing, tasks are launched + processed in the scheduler, and it seems more straightforward to directly deal with exceptions from probing tasks in the scheduler instead of using a ray object ref callback.

python/ray/serve/_private/router.py

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

This reverts commit f48f821.

## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>

zcin added 4 commits August 20, 2024 17:20

retry on actor died for system message, and handle actor died during …

693dae0

…request Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

Merge branch 'master' into detect-actor-died-early

8b25121

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

fix

7f55c43

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

Merge branch 'master' into detect-actor-died-early

5fbaf8f

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin marked this pull request as ready for review August 21, 2024 18:57

zcin requested a review from edoakes August 21, 2024 18:58

edoakes reviewed Aug 21, 2024

View reviewed changes

zcin added 2 commits August 22, 2024 08:41

address comments

f5dd4b7

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

Merge branch 'master' into detect-actor-died-early

fc71719

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin self-assigned this Aug 27, 2024

Merge branch 'master' into detect-actor-died-early

362efcd

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

edoakes approved these changes Sep 11, 2024

View reviewed changes

zcin added the go add ONLY when ready to merge, run all tests label Sep 11, 2024

zcin added 2 commits September 11, 2024 15:18

fix test router

f090600

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

deflake test_handle_router_updated_replicas_then_gcs_failure

9ad80e4

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin merged commit f48f821 into ray-project:master Sep 12, 2024
5 checks passed

can-anyscale added a commit that referenced this pull request Sep 12, 2024

Revert "[serve] Faster detection of dead replicas (#47237)"

9f42726

This reverts commit f48f821.

can-anyscale mentioned this pull request Sep 12, 2024

Revert "[serve] Faster detection of dead replicas" #47629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Faster detection of dead replicas #47237

[serve] Faster detection of dead replicas #47237

zcin commented Aug 21, 2024 •

edited

Loading

edoakes left a comment

edoakes Aug 21, 2024

zcin Aug 22, 2024

edoakes Aug 22, 2024

zcin Sep 11, 2024 •

edited

Loading

[serve] Faster detection of dead replicas #47237

[serve] Faster detection of dead replicas #47237

Conversation

zcin commented Aug 21, 2024 • edited Loading

Why are these changes needed?

Performance evaluation

Related issue number

Checks

edoakes left a comment

Choose a reason for hiding this comment

edoakes Aug 21, 2024

Choose a reason for hiding this comment

zcin Aug 22, 2024

Choose a reason for hiding this comment

edoakes Aug 22, 2024

Choose a reason for hiding this comment

zcin Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

zcin commented Aug 21, 2024 •

edited

Loading

zcin Sep 11, 2024 •

edited

Loading