Skip to content

Commit

Permalink
[core][dashboard][agent] add configurable timeouts for rt env agent a…
Browse files Browse the repository at this point in the history
…nd job_supervisor (#47481)

GcsClient has a configurable timeout `nums_py_gcs_reconnect_retry`.
However in GcsAioClient it's default to 5 and there's no way to control
it. This PR changes the caller to use the flag
`gcs_rpc_server_reconnect_timeout_s` to make it configurable. It's
already used in agent.py but not in rt env agent and job_supervisor.
This PR fixes all GcsAioClient caller in non-test python codebase.

Note that head.py has retry=0 which should mean infinite retry but it
did not work. Fixes by checking 0-ness.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
  • Loading branch information
rynewang authored Sep 4, 2024
1 parent 9781c6c commit e9f7930
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions python/ray/_private/gcs_aio_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
)
import ray._private.utils
from ray._private.ray_constants import env_integer
import ray

# Number of executor threads. No more than this number of concurrent GcsAioClient calls
# can happen. Extra requests will need to wait for the existing requests to finish.
Expand Down Expand Up @@ -52,8 +53,8 @@ def __init__(
executor=None,
nums_reconnect_retry: int = 5,
):
# See https://github.com/ray-project/ray/blob/d0b46eff9ddcf9ec7256dd3a6dda33e7fb7ced95/python/ray/_raylet.pyx#L2693 # noqa: E501
timeout_ms = 1000 * (nums_reconnect_retry + 1)
# This must be consistent with GcsClient.__cinit__ in _raylet.pyx
timeout_ms = ray._config.py_gcs_connect_timeout_s() * 1000
self.inner = NewGcsClient.standalone(
str(address), cluster_id=None, timeout_ms=timeout_ms
)
Expand Down

0 comments on commit e9f7930

Please sign in to comment.