-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Faster detection of dead replicas #47237
Conversation
…request Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, only nits. Thanks for the detailed benchmarking, it's fantastic! 💪
logger.warning( | ||
f"{replica_id} will not be considered for future " | ||
"requests because it has died." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm probably this message should just be logging inside the callback so it's consistent across callsites
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean to unify the logging when an error is received on system message (line 559) vs during actual request (line 505)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and also during active probing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edoakes Hmm, took a look at code and might be hard to unify. For active probing, tasks are launched + processed in the scheduler, and it seems more straightforward to directly deal with exceptions from probing tasks in the scheduler instead of using a ray object ref callback.
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
This reverts commit f48f821.
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Why are these changes needed?
Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message.
Performance evaluation
(master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42)
Latency:
Throughput:
Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe.
![Screenshot 2024-08-21 at 11 54 55 AM](https://private-user-images.githubusercontent.com/15851518/360062529-6c7caa40-ae3c-417b-a5bf-332e2d6ca378.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk4MDkyNzAsIm5iZiI6MTczOTgwODk3MCwicGF0aCI6Ii8xNTg1MTUxOC8zNjAwNjI1MjktNmM3Y2FhNDAtYWUzYy00MTdiLWE1YmYtMzMyZTJkNmNhMzc4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE3VDE2MTYxMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTlhYzM5OTFmYjljNjU4ZDA0YTljYjY0N2YyZjVlNTI0N2RmNjhkZmMyYTgwMTRhYzg4YjAyOGEyMWU3NzU4NmUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.smibN0yTWmjxb8U2BeiCV0w7TdGGGp__dDMj0MOenYY)
Related issue number
closes #47219
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.