Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hardware stats to train_head #46719

Merged

Conversation

alanwguo
Copy link
Contributor

@alanwguo alanwguo commented Jul 20, 2024

Why are these changes needed?

CPU utilization and GPU utilization is useful to see next to Train Workers.
Fixes actor GPU utilization not working due to bug introduced in #41399

Also refactor to use DataOrganizer to re-use more code with rest of dashboard.

Screenshot 2024-07-23 at 1 24 17 AM

Related issue number

fixes bug introduced in #41399

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
@alanwguo alanwguo added the go add ONLY when ready to merge, run all tests label Jul 24, 2024
Signed-off-by: Alan Guo <aguo@anyscale.com>
@scottsun94
Copy link
Contributor

nice!

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!! Left some comments

python/ray/train/_internal/state/schema.py Show resolved Hide resolved
python/ray/dashboard/modules/train/train_head.py Outdated Show resolved Hide resolved
python/ray/train/_internal/state/schema.py Show resolved Hide resolved
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The train_head and schema part looks good to me!



@DeveloperAPI
class ProcessStats(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this schema based on an existing one? See questions below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is the existing gpus schema for nodes and actors. I agree it's ugly...

I think with export API, we have a chance to really clean this up for the future but let's not bundle it in with the train dashboard changes.


@DeveloperAPI
class ProcessStats(BaseModel):
cpuPercent: float
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be 0 to 1 or 0 to 100?

Comment on lines +61 to +63
# total memory, free memory, memory used ratio
mem: Optional[List[int]]
memoryInfo: MemoryInfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is mem a list and not individual fields? Why are these not part of memoryInfo?

Comment on lines +77 to +79
utilizationGpu: Optional[float]
memoryUsed: float
memoryTotal: float
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be in ProcessGPUUSage?

Comment on lines +66 to +69
class ProcessGPUUsage(BaseModel):
# This gpu usage stats from a process
pid: int
gpuMemoryUsage: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this give the usage specific to the pid? In that case should GPUStats actually take a list of processInfos?

@matthewdeng matthewdeng merged commit e1e7558 into ray-project:master Jul 26, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants