Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[job] Dashboard job log URL doesn't work when dashboard agent listen port changes. #33397

Closed
rickyyx opened this issue Mar 16, 2023 · 5 comments · Fixed by #33834
Closed

[job] Dashboard job log URL doesn't work when dashboard agent listen port changes. #33397

rickyyx opened this issue Mar 16, 2023 · 5 comments · Fixed by #33834
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order Ray-2.4

Comments

@rickyyx
Copy link
Contributor

rickyyx commented Mar 16, 2023

What happened + What you expected to happen

So it seems we have the job log url hardcoded for a job submitted through job cli.

Versions / Dependencies

master

Reproduction script

ray start --dashboard-agent-listen-port 6945 --head 
ray job submit --python -c 'import ray;ray.init()' 

Will yield dashboard logs link as the default port:
image
image

As a comparison, running the driver script directly, however, yields the right log.
image

Issue Severity

None

@rickyyx rickyyx added the bug Something that is supposed to be working; but isn't label Mar 16, 2023
@alanwguo
Copy link
Contributor

So the diference here is that the job submission uses this field: driver_agent_http_address while the other job uses driver_info.

driver_agent_http_address gets its port from ray.worker.global_worker.node.dashboard_agent_listen_port.
https://github.com/ray-project/ray/blob/master/dashboard/modules/job/job_manager.py#L372

This seems to have the wrong port set? @rickyyx or @rkooo567 , do you know who sets this that value?

@alanwguo
Copy link
Contributor

Simple repro steps:

ray start --dashboard-agent-listen-port 6945 --head 
ipython
import ray
ray.init()
ray.worker.global_worker.node.dashboard_agent_listen_port

@alanwguo
Copy link
Contributor

Btw, I tried deploying a service with a custom dashboard-agent-listen-port and ran into an error. Seems like many things break when we try to customize the dashboard agent listen port. CC: @edoakes @sihanwang41

serve deploy serve_config.yaml
Traceback (most recent call last):
  File "/Users/aguo/ws/ray/python/ray/serve/scripts.py", line 185, in deploy
    ServeDeploySchema.parse_obj(config)
  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 3 validation errors for ServeDeploySchema
deployments
  extra fields not permitted (type=value_error.extra)
import_path
  extra fields not permitted (type=value_error.extra)
runtime_env
  extra fields not permitted (type=value_error.extra)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
    chunked=chunked,
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb07aeef790>: Failed to establish a new connection: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/requests/adapters.py", line 499, in send
    timeout=timeout,
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=52365): Max retries exceeded with url: /api/ray/version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb07aeef790>: Failed to establish a new connection: [Errno 61] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/aguo/ws/ray/python/ray/dashboard/modules/dashboard_sdk.py", line 242, in _check_connection_and_version_with_url
    r = self._do_request("GET", url)
  File "/Users/aguo/ws/ray/python/ray/dashboard/modules/dashboard_sdk.py", line 290, in _do_request
    **kwargs,
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/requests/adapters.py", line 565, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=52365): Max retries exceeded with url: /api/ray/version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb07aeef790>: Failed to establish a new connection: [Errno 61] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/aguo/opt/anaconda3/envs/ray/bin/serve", line 33, in <module>
    sys.exit(load_entry_point('ray', 'console_scripts', 'serve')())
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/aguo/opt/anaconda3/envs/ray/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/aguo/ws/ray/python/ray/serve/scripts.py", line 190, in deploy
    ServeSubmissionClient(address).deploy_application(config)
  File "/Users/aguo/ws/ray/python/ray/dashboard/modules/serve/sdk.py", line 71, in __init__
    url="/api/ray/version",
  File "/Users/aguo/ws/ray/python/ray/dashboard/modules/dashboard_sdk.py", line 259, in _check_connection_and_version_with_url
    f"Failed to connect to Ray at address: {self._address}."
ConnectionError: Failed to connect to Ray at address: http://localhost:52365

@edoakes
Copy link
Contributor

edoakes commented Mar 17, 2023

@alanwguo if you change the port you need to change --address to serve deploy

@rickyyx rickyyx assigned rickyyx and unassigned alanwguo and architkulkarni Mar 20, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Mar 21, 2023

So the diference here is that the job submission uses this field: driver_agent_http_address while the other job uses driver_info.

driver_agent_http_address gets its port from ray.worker.global_worker.node.dashboard_agent_listen_port. https://github.com/ray-project/ray/blob/master/dashboard/modules/job/job_manager.py#L372

This seems to have the wrong port set? @rickyyx or @rkooo567 , do you know who sets this that value?

So I believe this is when we connect to an existing ray node (e.g. from a driver script), the node is initialized with a dashboard_agent_listen_port that's the default from ray param:

self._dashboard_agent_listen_port = ray_params.dashboard_agent_listen_port

What it should actually do is to take the port from the cache port on the node, just like metrics_export_port does here:

"metrics_agent_port", default_port=ray_params.metrics_agent_port

Working on a fix for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order Ray-2.4
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants