[CI][brainstorming] make model training as a github action based on tekton #212

SamYuan1990 · 2023-12-25T08:20:34Z

as a brainstorming, if we make model training as a github action which just base on tekton, we can benefits others to provide their training result to us? as he/she can run the github action on their own self hosted github runner targeting with their own k8s cluster with tekton.

kepler-model-server/.github/workflows/train-model-self-hosted.yml

Lines 142 to 178 in 0609df4

    
                 - name: Deploy Tasks and Pipelines 
        
                   working-directory: model_training/tekton 
        
                   run: | 
        
                     kubectl apply -f tasks 
        
                     kubectl apply -f tasks/s3-pusher 
        
                     kubectl apply -f pipelines 
        
                 - name: Run Tekton Pipeline 
        
                   run: | 
        
                     cat <<EOF | kubectl apply -f - 
        
                     apiVersion: tekton.dev/v1 
        
                     kind: PipelineRun 
        
                     metadata: 
        
                       name: self-hosted-aws 
        
                     spec: 
        
                       timeouts: 
        
                         pipeline: "6h" 
        
                         tasks: "5h50m" 
        
                       workspaces: 
        
                       - name: mnt 
        
                         persistentVolumeClaim: 
        
                           claimName: task-pvc 
        
                       params: 
        
                       - name: PIPELINE_NAME 
        
                         value: std_v${VERSION} 
        
                       - name: OUTPUT_TYPE 
        
                         value: AbsPower 
        
                       - name: COS_PROVIDER 
        
                         value: aws 
        
                       - name: COS_SECRET_NAME 
        
                         value: aws-cos-secret 
        
                       - name: MACHINE_ID 
        
                         value: ${{ inputs.instance_type }}-${{ inputs.ami_id }} 
        
                       pipelineRef: 
        
                         name: single-train-pipeline 
        
                     EOF 
        
                     ./hack/k8s_helper.sh wait_for_pipelinerun self-hosted-aws 
        
                     df -h

sunya-ch · 2024-01-16T05:19:15Z

We might prepare another GitHub workflow on specific branch name for pushing a PR with result from their COS to kepler-model-db.

The steps on my thought are

Contributor sets AWS COS secret on their branch.
When train-model-self-hosted or train is called, the updated model will be kept in their COS.
If the branch contains the keyword such as pr-to-kepler-model-db, the to-be-created step like pr-to-kepler-model-db will be applied after model is updated on the COS. This step will run a script to pull latest image from kepler-model-db, read model on COS, run export command.

@SamYuan1990 Do you want to work on this?

Note:

currently only COS on aws is available but we can improve later

SamYuan1990 · 2024-01-17T13:38:48Z

let's keep collect requirements and ideas in this ticket.
I will update my ideas and break down my plans later.

SamYuan1990 · 2024-01-22T14:06:00Z

here is my plan. @rootfs , @sunya-ch , @marceloamaral
at a high level points of view, I would like 3 topics.

Greening CI/CD as use kepler to Greening CI/CD for kepler itself.
Our test case on BM/VM.
Tekton based training.

I am open if we make things implemented by Tekton

Which all those 3 topics basing on our current deployment stack. which also applied for a self hosted instance.(@jiere here)

Note: that promethes/otel + kepler + model server can be deployed by any kind of deployment tooling, either helm, operator or manifests files.

Hence to achieve that, we need to build new and enhancement with current our CI toolings.

https://github.com/sustainable-computing-io/aws_ec2_self_hosted_runner to provide us with BM on AWS.
local-dev-cluster to provide and set up k8s cluster.
kepler action as a github action running on BM to set up k8s.
A new github action based on Tekton to trigger model server training.

Let's start with Tekton based training.
As the training result, a model file(s), which we can update to github/open data hub or for self hosted, a private artifactory owned by our user is open for discussion.

About test and verifications
I suppose we can reuse kepler model server's training process as traffic loads to the k8s cluster. Which can be just run for verification purposes or with some new test cases. IMO, we can't verify kepler without some workload, hence the workload for training process can be reused.

3rd, a green pipeline
As previously in our community, want to base on kepler to build a green pipeline. Hence an interesting question comes out.

Can we make kepler as an example for greening CI/CD pipeline for itself?

if we assume kepler is a workload or a running job for greening CI/CD pipeline. Or in another point of view, running a kepler 's benchmark testing is a part of workload as same as a traffic load running on k8s. Which specific is that the workload is from kepler itself. :-)

sunya-ch · 2024-01-23T04:45:26Z

Thank you for started this planning.

It seems many points to discuss but let me first start with the requirement for the power modeling.

CICD Test cases for each environment

(A) Test case for BM

0. setup environment

Agree to what you planned:

Hence to achieve that, we need to build new and enhancement with current our CI toolings.

https://github.com/sustainable-computing-io/aws_ec2_self_hosted_runner to provide us with BM on AWS.
local-dev-cluster to provide and set up k8s cluster.
kepler action as a github action running on BM to set up k8s.
A new github action based on Tekton to trigger model server training.

Currently, I reuse the code from local-dev-cluster to create a cluster with some modification of kind configuration, refer to the Kepler deployment from the main repo, customize to patch model server.
However, it would be nice if we can update the modification to local-dev-cluster and use Kepler-operator with KeplerInternal CR to deploy model server components.

kepler-model-server/.github/workflows/train-model-self-hosted.yml

Lines 106 to 141 in 055f537

    
                 - name: Prepare Cluster 
        
                   working-directory: model_training 
        
                   run: | 
        
                     ./script.sh cluster_up 
        
                     cp $HOME/bin/kubectl /usr/local/bin/kubectl 
        
                     kubectl get po -A 
        
                 - name: Install Kepler 
        
                   working-directory: model_training 
        
                   run: | 
        
                     ./script.sh deploy_kepler 
        
                     ./script.sh deploy_prom_dependency 
        
                     kubectl logs $(kubectl get pods -oname -nkepler) -n kepler|grep "obtain power" 
        
                 - name: Install Tekton 
        
                   run: | 
        
                     kubectl apply --filename https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml 
        
                     ./hack/k8s_helper.sh rollout_ns_status tekton-pipelines 
        
                     ./hack/k8s_helper.sh rollout_ns_status tekton-pipelines-resolvers 
        
                 - name: Prepare PVC 
        
                   working-directory: model_training/tekton 
        
                   run: | 
        
                     kubectl apply -f pvc/hostpath.yaml 
        
                 - name: Deploy S3 Secret 
        
                   run: | 
        
                     cat <<EOF | kubectl apply -f - 
        
                     apiVersion: v1 
        
                     kind: Secret 
        
                     metadata: 
        
                       name: aws-cos-secret 
        
                     type: Opaque 
        
                     stringData: 
        
                       accessKeyID: ${{ secrets.aws_access_key_id }} 
        
                       accessSecret: ${{ secrets.aws_secret_access_key }} 
        
                       regionName: ${{ secrets.aws_region }} 
        
                       bucketName: kepler-power-model 
        
                     EOF

1. verify feature inputs from Kepler (input)

verify that all utilization/power metrics have values
verify that the utilization value is correct. with the task to run stress-ng, we can estimate the expected value. Such as, in one sec, the CPU time should be nearly to 1 sec per cores (32 cores in 3 seconds should spend ~96s).

2. verify model training process (process)

verify that it can successfully run without error and produced the model

3. verify trained model results (output)

verify the accuracy of the measurement and predicted power is less than threshold
verify the trained model can be applied by power estimator
verify the accuracy between the exported value from Kepler when using the model to the measurement value
- here we need to update Kepler for a mechanism to disable the measurement even if there is a power meter there.

(B) Test case for VM

1. verify feature inputs from Kepler (input)

verify the expected available metrics on VM have values
verify the utilization value is correct

2. verify estimator (output)

verify the accuracy between the exported value from Kepler to the similar machine powers when using local estimator
verify the accuracy between the exported value from Kepler to the similar machine powers when using sidecar estimator

Integration

trained model delivery

Now, we have CI to push model to kepler project AWS s3 after train

kepler-model-server/model_training/tekton/pipelines/single-train.yaml

Lines 275 to 309 in 055f537

    
           finally: 
        
             - name: ibmcloud-s3-push 
        
               when: 
        
               - input: "$(params.COS_PROVIDER)" 
        
                 operator: in 
        
                 values: ["ibmcloud"] 
        
               - input: "$(params.COS_SECRET_NAME)" 
        
                 operator: notin 
        
                 values: [""] 
        
               workspaces: 
        
               - name: mnt 
        
               params: 
        
               - name: COS_SECRET_NAME 
        
                 value: $(params.COS_SECRET_NAME) 
        
               - name: MACHINE_ID 
        
                 value: $(params.MACHINE_ID) 
        
               taskRef: 
        
                 name: ibmcloud-s3-push 
        
             - name: aws-s3-push 
        
               when: 
        
               - input: "$(params.COS_PROVIDER)" 
        
                 operator: in 
        
                 values: ["aws"] 
        
               - input: "$(params.COS_SECRET_NAME)" 
        
                 operator: notin 
        
                 values: [""] 
        
               workspaces: 
        
               - name: mnt 
        
               params: 
        
               - name: COS_SECRET_NAME 
        
                 value: $(params.COS_SECRET_NAME) 
        
               - name: MACHINE_ID 
        
                 value: $(params.MACHINE_ID) 
        
               taskRef: 
        
                 name: aws-s3-push

planning to extend model server to load mode from s3: Make Server API to support loading model from COS #213
[Discussion required] We may create a secret to allow only s3:GetObject and keep that inside the kepler base image to allow access from users to load the model from our s3 or we add CI to push a PR to kepler-model-db github and use the URL-based as current one.
We need CI to push PR to Kepler to update its local model data too: https://github.com/sustainable-computing-io/kepler/tree/main/data/model_weight

sunya-ch · 2024-01-24T07:27:08Z

We also have to think CI pipeline for notifying changes that requires changes and support on the other repo.

For example,

kepler changes metrics (name, labels, values) --> notify kepler-model-server
kepler-model-server changes model --> notify kepler-model-db to update the model
kepler-model-db updates --> notify kepler to sync

FYI, simplified communication diagram between three repos

will be updated to README page by #223

sunya-ch · 2024-01-26T15:05:37Z

Here is my current refactoring design.
Now, most components are done except push-pr-to-db. Still, many help needed.

SamYuan1990 · 2024-01-29T11:26:48Z

@sunya-ch , your latest comments just for kepler and kepler-model-server? could you please adding other project such as peaks as consideration ? I am interested with what will be.... when we add peaks into consideration.... and how many components we can reuse.

sunya-ch · 2024-01-30T01:20:57Z

I think we also need people for peak project to list up their requirements.

We can prepare an action to reuse integration test with inputs of kepler image, model_server image, and deployment choice. There are multiple ways to install: 1. by operator 2. by manifests 3. by helm-chart. We may need to prepare all of them to test the integration test.

should be included in operator schedule/push on related repo 3. should be in helm-chart push on related repo 2. should be on kepler and kepler_model_server repo when each other has pushed to main.

SamYuan1990 · 2024-02-14T14:43:37Z

some todo item after review kepler CI fix at sustainable-computing-io/kepler#1239

make validation image and validation binary as a part of release.
make sbom file as a part of release, as rpm as attachment file.
remove latest tag from release as latest tag is used for daily job.
make validation binary as part of rpm? @rootfs, wdyt?
make validation image as a part of daily job.
a clear document for how to deploy and use validation image. @jiere
make validation image build from kepler builder.
a default image build process with latest as tag value for images and receive parameters for release, daily, pr build and validation usage to reduce copy paste ourself.

SamYuan1990 · 2024-02-14T14:51:27Z

some ideas for self host instance repo, IMO, suggestions below, aiming at use an ansible playbook to set up k8s cluster among 3 ec2 instance created by self host instance GHA.

set up ansible from GHA instance to ec2 instances created by the self host instance GHA.
test network access between different ec2 instances created by the self host instance GHA.
run ansible playbook from GHA instance to ec2 instances to set up k8s cluster.
export/import the kubectl secret in tmp for CI. (any security question marks here?)
run kubectl to check status for cluster from GHA instance after import cluster secret.

is there any GHA to set up a k8s cluster via ansible or other CI tools we can reuse, or OCP, container ready? @rootfs wdyt

SamYuan1990 · 2024-02-14T14:53:44Z

extend local-dev-cluster with Prometheus operator, tekton targeting to a specific k8s cluster and decouple with kind cluster.
hence make the tekton can support kepler-model-server.

SamYuan1990 · 2024-02-14T15:01:10Z

@rootfs , @jiere , @sunya-ch wdyt if we have a repo for kepler validation and kepler model server validation, the new repo contains

some script to add traffic on k8s cluster via tekton for either model server training, or workload for validation.
some test case for validation.

which

the release of the repo can be used for kepler's model training and validation on specific instance.
the release of the repo can be used for investigation for peaks and clever. +@husky-parul , @wangchen615 IMO, when we do investigation for peaks or clever, we need some thing(script?) to build a benchmark? and the benchmark maybe an implementation for cloud native sustainable computing benchmark white paper as a part of [Action to craft a Project Proposal] Benchmarking Whitepaper cncf/tag-env-sustainability#327 ?

SamYuan1990 · 2024-02-20T13:47:38Z

@sunya-ch , @rootfs , @marceloamaral can we use https://github.com/medyagh/setup-minikube to set up minikube for kepler model server training or kepler validation process instead of KIND(k8s in docker), wdyt?
if yes, any volume mount settings?

SamYuan1990 · 2024-03-05T14:44:54Z

@rootfs , @marceloamaral let's sync up https://github.com/kubevirt/kubevirt solution for validation here.
my question is, as for model sever we use CPE frame work as workload ... what kind of workload are we going to use for validation?

SamYuan1990 · 2024-03-11T12:07:19Z

once sustainable-computing-io/kepler-action#108 been merged, we will try to use latest kepler-action to integrate with kepler-model-server.

sunya-ch · 2024-03-19T13:19:43Z

@rootfs , @marceloamaral let's sync up https://github.com/kubevirt/kubevirt solution for validation here.
my question is, as for model sever we use CPE frame work as workload ... what kind of workload are we going to use for validation?

@SamYuan1990 Now, the CPE is obsoleted and we now use tekton task/pipeline to run the stress-ng workload and then collect the data. The stress workload includes stressing the CPU up to 100% of all cores.

sthaha · 2024-04-10T01:51:27Z

@SamYuan1990

Based on the discussion about validating the Model here is the setup we want to achieve validation is as follows

Single Bare Metal

Bare metal should run VM
both bare metal and vm should have kepler running

Kepler on Bare Metal

intel-rapl (power - meter)
acpi

Kepler on VM

kepler (linear regression model weights only)
kepler + estimator + (model)
kepler + estimator + model-server

sunya-ch · 2024-05-22T06:46:31Z

We should break down the task according to this issue into separated issues to track the progress.
I created a project for power model validation here:
https://github.com/orgs/sustainable-computing-io/projects/6/views/1

sunya-ch · 2025-01-22T08:15:26Z

We should decide whether to continue working for process via tekton or utilize the metal-ci. This issue is overlapped with sustainable-computing-io/kepler#1910.

SamYuan1990 assigned rootfs, jiere and sunya-ch Dec 25, 2023

sunya-ch added this to the kepler-release-0.7 milestone Jan 17, 2024

SamYuan1990 self-assigned this Jan 19, 2024

sunya-ch mentioned this issue Feb 6, 2024

add local db, remove kubelet, update push-pr #222

Merged

5 tasks

SamYuan1990 mentioned this issue Apr 25, 2024

[Action] Practice to measure power consumption for a project which CI/CD. cncf/tag-env-sustainability#397

Open

5 tasks

This was referenced Jun 19, 2024

Set up Github Actions for Model Repository sustainable-computing-io/kepler-model-db#1

Open

Documenting Kepler Metrics Validation framework sustainable-computing-io/kepler#1533

Open

sunya-ch modified the milestones: kepler-release-0.7, 2/2 POC for issue #1910 Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][brainstorming] make model training as a github action based on tekton #212

[CI][brainstorming] make model training as a github action based on tekton #212

SamYuan1990 commented Dec 25, 2023

sunya-ch commented Jan 16, 2024 •

edited

Loading

SamYuan1990 commented Jan 17, 2024

SamYuan1990 commented Jan 22, 2024

sunya-ch commented Jan 23, 2024 •

edited

Loading

sunya-ch commented Jan 24, 2024 •

edited

Loading

sunya-ch commented Jan 26, 2024 •

edited

Loading

SamYuan1990 commented Jan 29, 2024 •

edited

Loading

sunya-ch commented Jan 30, 2024

SamYuan1990 commented Feb 14, 2024 •

edited

Loading

SamYuan1990 commented Feb 14, 2024

SamYuan1990 commented Feb 14, 2024

SamYuan1990 commented Feb 14, 2024

SamYuan1990 commented Feb 20, 2024

SamYuan1990 commented Mar 5, 2024 •

edited

Loading

SamYuan1990 commented Mar 11, 2024

sunya-ch commented Mar 19, 2024

sthaha commented Apr 10, 2024

sunya-ch commented May 22, 2024

sunya-ch commented Jan 22, 2025

[CI][brainstorming] make model training as a github action based on tekton #212

[CI][brainstorming] make model training as a github action based on tekton #212

Comments

SamYuan1990 commented Dec 25, 2023

sunya-ch commented Jan 16, 2024 • edited Loading

SamYuan1990 commented Jan 17, 2024

SamYuan1990 commented Jan 22, 2024

sunya-ch commented Jan 23, 2024 • edited Loading

CICD Test cases for each environment

(A) Test case for BM

0. setup environment

1. verify feature inputs from Kepler (input)

2. verify model training process (process)

3. verify trained model results (output)

(B) Test case for VM

1. verify feature inputs from Kepler (input)

2. verify estimator (output)

Integration

trained model delivery

sunya-ch commented Jan 24, 2024 • edited Loading

sunya-ch commented Jan 26, 2024 • edited Loading

SamYuan1990 commented Jan 29, 2024 • edited Loading

sunya-ch commented Jan 30, 2024

SamYuan1990 commented Feb 14, 2024 • edited Loading

SamYuan1990 commented Feb 14, 2024

SamYuan1990 commented Feb 14, 2024

SamYuan1990 commented Feb 14, 2024

SamYuan1990 commented Feb 20, 2024

SamYuan1990 commented Mar 5, 2024 • edited Loading

SamYuan1990 commented Mar 11, 2024

sunya-ch commented Mar 19, 2024

sthaha commented Apr 10, 2024

Single Bare Metal

Kepler on Bare Metal

Kepler on VM

sunya-ch commented May 22, 2024

sunya-ch commented Jan 22, 2025

sunya-ch commented Jan 16, 2024 •

edited

Loading

sunya-ch commented Jan 23, 2024 •

edited

Loading

sunya-ch commented Jan 24, 2024 •

edited

Loading

sunya-ch commented Jan 26, 2024 •

edited

Loading

SamYuan1990 commented Jan 29, 2024 •

edited

Loading

SamYuan1990 commented Feb 14, 2024 •

edited

Loading

SamYuan1990 commented Mar 5, 2024 •

edited

Loading