Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Ray Dashboard: GPU stats per actor is empty #48312

Open
aviadshimoni opened this issue Oct 29, 2024 · 9 comments
Open

[BUG] Ray Dashboard: GPU stats per actor is empty #48312

aviadshimoni opened this issue Oct 29, 2024 · 9 comments
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@aviadshimoni
Copy link

What happened + What you expected to happen

Ray deployments don’t have GPU/GRAM tracking (“Actors” section) in Ray dashboard, stats are shown in Clustet tab (images attached).
image (24)
https://ray.slack.com/files/U05D35JHGUV/F07TJAFKMTR/image.png?origin_team=TN4768NRM&origin_channel=CMVUQ1KMX

Is this intended? are we having issues aggregating GPU stats per actor?

Versions / Dependencies

KubeRay: 1.1.1
Ray Version: 2.34.0
CRD: v1

Docker image: rayproject/ray:2.34.0-py310-gpu

Reproduction script

Deploy any ray service, access it's dashboard and see 'Actors' Tab on top navigation bar.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@aviadshimoni aviadshimoni added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 29, 2024
@aviadshimoni
Copy link
Author

@aviadshimoni aviadshimoni changed the title Ray Dashboard: GPU stats per actor is empty [BUG] Ray Dashboard: GPU stats per actor is empty Oct 29, 2024
@jcotant1 jcotant1 added the dashboard Issues specific to the Ray Dashboard label Oct 29, 2024
@marwan116
Copy link
Contributor

this should now be resolved in Ray 2.35

@aviadshimoni
Copy link
Author

@marwan116 any chance to attach the PR fixing that so I can learn?

@marwan116
Copy link
Contributor

See this PR here:
#46719

Please close out this issue after you have verified that upgrading to 2.35 or beyond has resolved things

@dc914337
Copy link

Hi, we updated to 2.35 but the issue is still there. Could it be that you meant some newer version? 2.34 wasn't the latest version at the time.
Could you please confirm what version this fix is scheduled for?

@marwan116
Copy link
Contributor

@aviadshimoni does upgrading to 2.35 resolve your issue ? It would help triage with @dc914337

@dc914337
Copy link

@marwan116 sorry for the confusion. Aviad and I are working together on this.
I updated Ray in my project but it didn't resolve it.

@alanwguo
Copy link
Contributor

The GRAM stats require that the actor process actually attaches to the GPU. This can be done when running an actual workload that uses the GPU.

It's not enough for an actor to require a GPU as a resource.

Just checking, are you running an actual GPU workload in your manual testing?

@dc914337
Copy link

Yes, @alanwguo , we are running this model on hundreds of actual GPUs, and, in fact, they can't be run without a GPU.
I also confirmed that upon manually connecting to the pod, NVML can get the information from the GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

5 participants