-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Ray Dashboard: GPU stats per actor is empty #48312
Comments
this should now be resolved in Ray 2.35 |
@marwan116 any chance to attach the PR fixing that so I can learn? |
See this PR here: Please close out this issue after you have verified that upgrading to 2.35 or beyond has resolved things |
Hi, we updated to 2.35 but the issue is still there. Could it be that you meant some newer version? 2.34 wasn't the latest version at the time. |
@aviadshimoni does upgrading to 2.35 resolve your issue ? It would help triage with @dc914337 |
@marwan116 sorry for the confusion. Aviad and I are working together on this. |
The GRAM stats require that the actor process actually attaches to the GPU. This can be done when running an actual workload that uses the GPU. It's not enough for an actor to require a GPU as a resource. Just checking, are you running an actual GPU workload in your manual testing? |
Yes, @alanwguo , we are running this model on hundreds of actual GPUs, and, in fact, they can't be run without a GPU. |
What happened + What you expected to happen
Ray deployments don’t have GPU/GRAM tracking (“Actors” section) in Ray dashboard, stats are shown in Clustet tab (images attached).
https://ray.slack.com/files/U05D35JHGUV/F07TJAFKMTR/image.png?origin_team=TN4768NRM&origin_channel=CMVUQ1KMX
Is this intended? are we having issues aggregating GPU stats per actor?
Versions / Dependencies
KubeRay: 1.1.1
Ray Version: 2.34.0
CRD: v1
Docker image: rayproject/ray:2.34.0-py310-gpu
Reproduction script
Deploy any ray service, access it's dashboard and see 'Actors' Tab on top navigation bar.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: