-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hardware stats to train_head #46719
Add hardware stats to train_head #46719
Conversation
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
nice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!! Left some comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The train_head
and schema
part looks good to me!
|
||
|
||
@DeveloperAPI | ||
class ProcessStats(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this schema based on an existing one? See questions below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is the existing gpus schema for nodes and actors. I agree it's ugly...
I think with export API, we have a chance to really clean this up for the future but let's not bundle it in with the train dashboard changes.
|
||
@DeveloperAPI | ||
class ProcessStats(BaseModel): | ||
cpuPercent: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be 0 to 1 or 0 to 100?
# total memory, free memory, memory used ratio | ||
mem: Optional[List[int]] | ||
memoryInfo: MemoryInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is mem
a list and not individual fields? Why are these not part of memoryInfo
?
utilizationGpu: Optional[float] | ||
memoryUsed: float | ||
memoryTotal: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be in ProcessGPUUSage
?
class ProcessGPUUsage(BaseModel): | ||
# This gpu usage stats from a process | ||
pid: int | ||
gpuMemoryUsage: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this give the usage specific to the pid
? In that case should GPUStats
actually take a list of processInfo
s?
Why are these changes needed?
CPU utilization and GPU utilization is useful to see next to Train Workers.
Fixes actor GPU utilization not working due to bug introduced in #41399
Also refactor to use DataOrganizer to re-use more code with rest of dashboard.
Related issue number
fixes bug introduced in #41399
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.