[Action] Practice to measure power consumption for a project which CI/CD. #397

SamYuan1990 · 2024-04-25T14:09:35Z

Description

The power consumption becomes a problem when we running LLM on data centers and k8s.
Ref to the cloud native AI white paper, the difference between technological steak the makes the case more complex.
For example, different GPU device, different deployment architectures, TEE from security points of view, etc.
Recently as kepler community completed a POC for set up tekton on a clean BM on AWS, and other discussions around to make kepler's validation with pipeline.
An interesting question is that how kepler validate itself between current testing version and latest stable version.
If so, which means with a stable version of kepler and pipeline as github action, tekton etc... we can make a pattern for measure power consumption for any project via CI/CD pipeline.

Outcome

Concept level: A pattern for any project on k8s to measure power consumption via group of cloud native tools as tekton, kind, kepler.
Implementation level: The pattern should be implemented flexible enough to cover different cases with pluggable with a sample code repo for share and reuse as github action or other... approaches.
self owned github runner.
different arch.
different OS.
BM/VM.
etc....
Deliver level: A blog and events to share this pattern.
Ownership level: From kepler community to share it to TAG as common/generic infra?

To-Do

kepler community complete validate kepler itself this year.
refine kepler model server totken logic, to decouple workload phase from model server training and reuse it as workload.
base on the workload, making pipeline to validate kepler between versions.
find another project replace workload parts and validation parts as an example.

note : as kepler's model having power from idle and dynamic, a workload is need for the target project to... get idle and dynamic power changes?

cc: @rootfs, @sunya-ch, @marceloamaral , please help me correction for any mistake. or we can correct later on.

Code of Conduct

I agree to follow this project's Code of Conduct

Comments

it may over years to be completed, maybe we can breakdown tasks and making things parallel.
Some previous discussion on sustainable-computing-io/kepler-model-server#212
the example https://github.com/sustainable-computing-io/aws_ec2_self_hosted_runner/blob/main/.github/workflows/ci_integration.yml#L35-L73 for set up tekton on a new created ec2 instance.

SamYuan1990 added board/unassigned issue/needs-triage labels Apr 25, 2024

leonardpahlke changed the title ~~[<Action>] Practice to measure power consumption for a project which CI/CD.~~ [Action] Practice to measure power consumption for a project which CI/CD. Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Action] Practice to measure power consumption for a project which CI/CD. #397

[Action] Practice to measure power consumption for a project which CI/CD. #397

SamYuan1990 commented Apr 25, 2024