Skip to content

Commit

Permalink
[serve] Faster detection of dead replicas (ray-project#47237)
Browse files Browse the repository at this point in the history
## Why are these changes needed?

Detect replica death earlier on handles/routers. Currently routers will
process replica death if the actor death error is thrown during active
probing or system message.
1. cover one more case: process replica death if error is thrown _while_
request was being processed on the replica.
2. improved handling: if error is detected on the system message,
meaning router found out replica is dead after assigning a request to
that replica, retry the request.

### Performance evaluation
(master results pulled from
https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42)

Latency:
| metric | master | this PR | % change |
| -- | -- | -- | -- |
| http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 |
| http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 |
| http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89
|
| grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 |
| grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 |
| grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22
|
| handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 |
| handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 |
-2.67 |
| handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 |
-1.96 |

Throughput:
| metric | master | this PR | % change |
| -- | -- | -- | -- |
| http_avg_rps | 359.14 | 357.81 | -0.37 |
| http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 |
| grpc_avg_rps | 506.16 | 485.92 | -4.0 |
| grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 |
| handle_avg_rps | 604.52 | 641.66 | 6.14 |
| handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 |

Results: everything except for grpc results are within noise. As for
grpc results, they have always been relatively noisy (see below), so the
results are actually also within the noise that we've been seeing. There
is also no reason why latency for a request would only increase for grpc
and not http or handle for the changes in this PR, so IMO this is safe.
![Screenshot 2024-08-21 at 11 54
55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378)

## Related issue number

closes ray-project#47219

---------

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
  • Loading branch information
zcin authored and ujjawal-khare committed Oct 15, 2024
1 parent 3db7bde commit 679d392
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 5 deletions.
5 changes: 4 additions & 1 deletion python/ray/serve/_private/router.py
Original file line number Diff line number Diff line change
Expand Up @@ -621,7 +621,10 @@ async def assign_request(
replica_id
)
callback = partial(self._process_finished_request, replica_id)
ref.add_callback(callback)
if isinstance(ref, (ray.ObjectRef, FakeObjectRef)):
ref._on_completed(callback)
else:
ref.completed()._on_completed(callback)

return ref
except asyncio.CancelledError:
Expand Down
5 changes: 1 addition & 4 deletions python/ray/serve/tests/test_failure.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,9 +258,6 @@ def make_blocked_request():
assert ray.get(blocked_ref) == "hi"


@pytest.mark.skipif(
sys.platform == "win32", reason="ActorDiedError not raised properly on windows."
)
@pytest.mark.parametrize("die_during_request", [False, True])
def test_replica_actor_died(serve_instance, die_during_request):
"""Test replica death paired with delayed handle notification.
Expand Down Expand Up @@ -298,7 +295,7 @@ def check_health(self):

# Kill one replica.
if die_during_request:
with pytest.raises(RayActorError):
with pytest.raises(ActorDiedError):
h.remote(crash=True).result()
else:
replica_to_kill = random.choice(replicas)
Expand Down

0 comments on commit 679d392

Please sign in to comment.