Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UI] When a Ray worker node dies, it makes Ray dashboard "cluster" page and "logs" page broken #47668

Closed
WeichenXu123 opened this issue Sep 16, 2024 · 11 comments · Fixed by #47701
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release

Comments

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Sep 16, 2024

What happened + What you expected to happen

When a Ray worker node dies, it makes Ray dashboard "cluster" page and "logs" page broken

Normal "cluster page" containing dead node should be like:
image

But now it becomes:
image

and when Ray worker node dies, "logs" page is completely broken and show nothing.

Versions / Dependencies

This bug is introduced in Ray nightly build version , Ray 2.35 version is good.

Reproduction script

  1. install Ray nightly build version
  2. start a Ray head node like:
ray start --head --node-ip-address=127.0.0.1 --port 9988 --dashboard-host=0.0.0.0
  1. start a Ray worker node like:
ray start --address 127.0.0.1:9988 --block &
  1. Find the Ray worker node processes group, and kill (-9) all processes in the group
ps -ef|grep "ray start --address 127.0.0.1:9988"  # Get process id of "ray start --address 127.0.0.1:9988" command
kill -9 -{xxx} # kill the process **group** of id printed in above command
  1. Wait a few seconds, Open Ray dashboard cluster page. error occurs.

Checking nodes?view=summary request response, it got error like:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/optional_utils.py", line 224, in _update_cache
    response = task.result()
               ^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/modules/node/node_head.py", line 364, in get_all_nodes
    all_node_summary, nodes_logical_resources = await asyncio.gather(
                                                ^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/datacenter.py", line 171, in get_all_node_summary
    return [
           ^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/datacenter.py", line 172, in <listcomp>
    await DataOrganizer.get_node_info(node_id, get_summary=True)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/datacenter.py", line 150, in get_node_info
    node_info["status"] = node["stateSnapshot"]["state"]
                          ~~~~^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/utils.py", line 435, in __getitem__
    proxy = self._proxy[item] = make_immutable(self._dict[item])
                                               ~~~~~~~~~~^^^^^^
KeyError: 'stateSnapshot'

Issue Severity

None

@WeichenXu123 WeichenXu123 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 16, 2024
@WeichenXu123
Copy link
Contributor Author

CC @rkooo567 any ideas ?

@rkooo567
Copy link
Contributor

can you share the console log when this happens? Also cc @alanwguo @nikitavemuri

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Sep 16, 2024

can you share the console log when this happens? Also cc @alanwguo @nikitavemuri

which log file do you need ? Ray node console log shows nothing abnormal.

and the UI logs page is broken.

@rkooo567
Copy link
Contributor

I meant the UI log page (assuming there must be an exception there? )

@WeichenXu123
Copy link
Contributor Author

I meant the UI log page (assuming there must be an exception there? )

no.. nothing displayed, it turns out to be an blank page

@WeichenXu123
Copy link
Contributor Author

This is the console error when loading "logs" page. @rkooo567 any insights ?
image

@WeichenXu123
Copy link
Contributor Author

This is the error when loading "cluster" page:
image

@WeichenXu123
Copy link
Contributor Author

@rkooo567
The HTTP request nodes?view=summary failed with error:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/optional_utils.py", line 224, in _update_cache
    response = task.result()
               ^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/modules/node/node_head.py", line 364, in get_all_nodes
    all_node_summary, nodes_logical_resources = await asyncio.gather(
                                                ^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/datacenter.py", line 171, in get_all_node_summary
    return [
           ^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/datacenter.py", line 172, in <listcomp>
    await DataOrganizer.get_node_info(node_id, get_summary=True)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/datacenter.py", line 150, in get_node_info
    node_info["status"] = node["stateSnapshot"]["state"]
                          ~~~~^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-8ac7e00d-2d10-430a-b39d-ad4447d3a5ed/lib/python3.11/site-packages/ray/dashboard/utils.py", line 435, in __getitem__
    proxy = self._proxy[item] = make_immutable(self._dict[item])
                                               ~~~~~~~~~~^^^^^^
KeyError: 'stateSnapshot'

@anyscalesam anyscalesam added the dashboard Issues specific to the Ray Dashboard label Sep 16, 2024
@WeichenXu123
Copy link
Contributor Author

@anyscalesam Who should we assign this bug to ? thanks! This is a critical bug I think.

@jjyao jjyao added P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 17, 2024
@jjyao
Copy link
Collaborator

jjyao commented Sep 17, 2024

@rynewang it's likely due to the recent dashboard optimizations.

@jjyao
Copy link
Collaborator

jjyao commented Sep 17, 2024

I think it's due to #47367

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants