Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][brainstorming] make model training as a github action based on tekton #212

Open
SamYuan1990 opened this issue Dec 25, 2023 · 19 comments

Comments

@SamYuan1990
Copy link
Collaborator

as a brainstorming, if we make model training as a github action which just base on tekton, we can benefits others to provide their training result to us? as he/she can run the github action on their own self hosted github runner targeting with their own k8s cluster with tekton.

- name: Deploy Tasks and Pipelines
working-directory: model_training/tekton
run: |
kubectl apply -f tasks
kubectl apply -f tasks/s3-pusher
kubectl apply -f pipelines
- name: Run Tekton Pipeline
run: |
cat <<EOF | kubectl apply -f -
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
name: self-hosted-aws
spec:
timeouts:
pipeline: "6h"
tasks: "5h50m"
workspaces:
- name: mnt
persistentVolumeClaim:
claimName: task-pvc
params:
- name: PIPELINE_NAME
value: std_v${VERSION}
- name: OUTPUT_TYPE
value: AbsPower
- name: COS_PROVIDER
value: aws
- name: COS_SECRET_NAME
value: aws-cos-secret
- name: MACHINE_ID
value: ${{ inputs.instance_type }}-${{ inputs.ami_id }}
pipelineRef:
name: single-train-pipeline
EOF
./hack/k8s_helper.sh wait_for_pipelinerun self-hosted-aws
df -h

@sunya-ch
Copy link
Contributor

sunya-ch commented Jan 16, 2024

We might prepare another GitHub workflow on specific branch name for pushing a PR with result from their COS to kepler-model-db.

The steps on my thought are

  1. Contributor sets AWS COS secret on their branch.
  2. When train-model-self-hosted or train is called, the updated model will be kept in their COS.
  3. If the branch contains the keyword such as pr-to-kepler-model-db, the to-be-created step like pr-to-kepler-model-db will be applied after model is updated on the COS. This step will run a script to pull latest image from kepler-model-db, read model on COS, run export command.

@SamYuan1990 Do you want to work on this?

Note:

  • currently only COS on aws is available but we can improve later

@sunya-ch sunya-ch added this to the kepler-release-0.7 milestone Jan 17, 2024
@SamYuan1990
Copy link
Collaborator Author

let's keep collect requirements and ideas in this ticket.
I will update my ideas and break down my plans later.

@SamYuan1990 SamYuan1990 self-assigned this Jan 19, 2024
@SamYuan1990
Copy link
Collaborator Author

image

here is my plan. @rootfs , @sunya-ch , @marceloamaral
at a high level points of view, I would like 3 topics.

  1. Greening CI/CD as use kepler to Greening CI/CD for kepler itself.
  2. Our test case on BM/VM.
  3. Tekton based training.

I am open if we make things implemented by Tekton

Which all those 3 topics basing on our current deployment stack. which also applied for a self hosted instance.(@jiere here)

Note: that promethes/otel + kepler + model server can be deployed by any kind of deployment tooling, either helm, operator or manifests files.

Hence to achieve that, we need to build new and enhancement with current our CI toolings.

  1. https://github.com/sustainable-computing-io/aws_ec2_self_hosted_runner to provide us with BM on AWS.
  2. local-dev-cluster to provide and set up k8s cluster.
  3. kepler action as a github action running on BM to set up k8s.
  4. A new github action based on Tekton to trigger model server training.

Let's start with Tekton based training.
As the training result, a model file(s), which we can update to github/open data hub or for self hosted, a private artifactory owned by our user is open for discussion.

About test and verifications
I suppose we can reuse kepler model server's training process as traffic loads to the k8s cluster. Which can be just run for verification purposes or with some new test cases. IMO, we can't verify kepler without some workload, hence the workload for training process can be reused.

3rd, a green pipeline
As previously in our community, want to base on kepler to build a green pipeline. Hence an interesting question comes out.

Can we make kepler as an example for greening CI/CD pipeline for itself?

if we assume kepler is a workload or a running job for greening CI/CD pipeline. Or in another point of view, running a kepler 's benchmark testing is a part of workload as same as a traffic load running on k8s. Which specific is that the workload is from kepler itself. :-)

@sunya-ch
Copy link
Contributor

sunya-ch commented Jan 23, 2024

Thank you for started this planning.

It seems many points to discuss but let me first start with the requirement for the power modeling.

CICD Test cases for each environment

(A) Test case for BM

0. setup environment

Agree to what you planned:

Hence to achieve that, we need to build new and enhancement with current our CI toolings.

https://github.com/sustainable-computing-io/aws_ec2_self_hosted_runner to provide us with BM on AWS.
local-dev-cluster to provide and set up k8s cluster.
kepler action as a github action running on BM to set up k8s.
A new github action based on Tekton to trigger model server training.

Currently, I reuse the code from local-dev-cluster to create a cluster with some modification of kind configuration, refer to the Kepler deployment from the main repo, customize to patch model server.
However, it would be nice if we can update the modification to local-dev-cluster and use Kepler-operator with KeplerInternal CR to deploy model server components.

- name: Prepare Cluster
working-directory: model_training
run: |
./script.sh cluster_up
cp $HOME/bin/kubectl /usr/local/bin/kubectl
kubectl get po -A
- name: Install Kepler
working-directory: model_training
run: |
./script.sh deploy_kepler
./script.sh deploy_prom_dependency
kubectl logs $(kubectl get pods -oname -nkepler) -n kepler|grep "obtain power"
- name: Install Tekton
run: |
kubectl apply --filename https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
./hack/k8s_helper.sh rollout_ns_status tekton-pipelines
./hack/k8s_helper.sh rollout_ns_status tekton-pipelines-resolvers
- name: Prepare PVC
working-directory: model_training/tekton
run: |
kubectl apply -f pvc/hostpath.yaml
- name: Deploy S3 Secret
run: |
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: aws-cos-secret
type: Opaque
stringData:
accessKeyID: ${{ secrets.aws_access_key_id }}
accessSecret: ${{ secrets.aws_secret_access_key }}
regionName: ${{ secrets.aws_region }}
bucketName: kepler-power-model
EOF

1. verify feature inputs from Kepler (input)

  • verify that all utilization/power metrics have values
  • verify that the utilization value is correct. with the task to run stress-ng, we can estimate the expected value. Such as, in one sec, the CPU time should be nearly to 1 sec per cores (32 cores in 3 seconds should spend ~96s).

2. verify model training process (process)

  • verify that it can successfully run without error and produced the model

3. verify trained model results (output)

  • verify the accuracy of the measurement and predicted power is less than threshold
  • verify the trained model can be applied by power estimator
  • verify the accuracy between the exported value from Kepler when using the model to the measurement value
    • here we need to update Kepler for a mechanism to disable the measurement even if there is a power meter there.

(B) Test case for VM

1. verify feature inputs from Kepler (input)

  • verify the expected available metrics on VM have values
  • verify the utilization value is correct

2. verify estimator (output)

  • verify the accuracy between the exported value from Kepler to the similar machine powers when using local estimator
  • verify the accuracy between the exported value from Kepler to the similar machine powers when using sidecar estimator

Integration

trained model delivery

Now, we have CI to push model to kepler project AWS s3 after train

finally:
- name: ibmcloud-s3-push
when:
- input: "$(params.COS_PROVIDER)"
operator: in
values: ["ibmcloud"]
- input: "$(params.COS_SECRET_NAME)"
operator: notin
values: [""]
workspaces:
- name: mnt
params:
- name: COS_SECRET_NAME
value: $(params.COS_SECRET_NAME)
- name: MACHINE_ID
value: $(params.MACHINE_ID)
taskRef:
name: ibmcloud-s3-push
- name: aws-s3-push
when:
- input: "$(params.COS_PROVIDER)"
operator: in
values: ["aws"]
- input: "$(params.COS_SECRET_NAME)"
operator: notin
values: [""]
workspaces:
- name: mnt
params:
- name: COS_SECRET_NAME
value: $(params.COS_SECRET_NAME)
- name: MACHINE_ID
value: $(params.MACHINE_ID)
taskRef:
name: aws-s3-push

@sunya-ch
Copy link
Contributor

sunya-ch commented Jan 24, 2024

We also have to think CI pipeline for notifying changes that requires changes and support on the other repo.

For example,

kepler changes metrics (name, labels, values) --> notify kepler-model-server
kepler-model-server changes model --> notify kepler-model-db to update the model
kepler-model-db updates --> notify kepler to sync

FYI, simplified communication diagram between three repos
image

will be updated to README page by #223

@sunya-ch
Copy link
Contributor

sunya-ch commented Jan 26, 2024

Here is my current refactoring design.
Now, most components are done except push-pr-to-db. Still, many help needed.

ci-plan

@SamYuan1990
Copy link
Collaborator Author

SamYuan1990 commented Jan 29, 2024

@sunya-ch , your latest comments just for kepler and kepler-model-server? could you please adding other project such as peaks as consideration ? I am interested with what will be.... when we add peaks into consideration.... and how many components we can reuse.

@sunya-ch
Copy link
Contributor

I think we also need people for peak project to list up their requirements.

We can prepare an action to reuse integration test with inputs of kepler image, model_server image, and deployment choice. There are multiple ways to install: 1. by operator 2. by manifests 3. by helm-chart. We may need to prepare all of them to test the integration test.

  1. should be included in operator schedule/push on related repo 3. should be in helm-chart push on related repo 2. should be on kepler and kepler_model_server repo when each other has pushed to main.

@SamYuan1990
Copy link
Collaborator Author

SamYuan1990 commented Feb 14, 2024

some todo item after review kepler CI fix at sustainable-computing-io/kepler#1239

  • make validation image and validation binary as a part of release.
  • make sbom file as a part of release, as rpm as attachment file.
  • remove latest tag from release as latest tag is used for daily job.
  • make validation binary as part of rpm? @rootfs, wdyt?
  • make validation image as a part of daily job.
  • a clear document for how to deploy and use validation image. @jiere
  • make validation image build from kepler builder.
  • a default image build process with latest as tag value for images and receive parameters for release, daily, pr build and validation usage to reduce copy paste ourself.

@SamYuan1990
Copy link
Collaborator Author

some ideas for self host instance repo, IMO, suggestions below, aiming at use an ansible playbook to set up k8s cluster among 3 ec2 instance created by self host instance GHA.

  • set up ansible from GHA instance to ec2 instances created by the self host instance GHA.
  • test network access between different ec2 instances created by the self host instance GHA.
  • run ansible playbook from GHA instance to ec2 instances to set up k8s cluster.
  • export/import the kubectl secret in tmp for CI. (any security question marks here?)
  • run kubectl to check status for cluster from GHA instance after import cluster secret.

is there any GHA to set up a k8s cluster via ansible or other CI tools we can reuse, or OCP, container ready? @rootfs wdyt

@SamYuan1990
Copy link
Collaborator Author

extend local-dev-cluster with Prometheus operator, tekton targeting to a specific k8s cluster and decouple with kind cluster.
hence make the tekton can support kepler-model-server.

@SamYuan1990
Copy link
Collaborator Author

@rootfs , @jiere , @sunya-ch wdyt if we have a repo for kepler validation and kepler model server validation, the new repo contains

  • some script to add traffic on k8s cluster via tekton for either model server training, or workload for validation.
  • some test case for validation.

which

  1. the release of the repo can be used for kepler's model training and validation on specific instance.
  2. the release of the repo can be used for investigation for peaks and clever. +@husky-parul , @wangchen615 IMO, when we do investigation for peaks or clever, we need some thing(script?) to build a benchmark? and the benchmark maybe an implementation for cloud native sustainable computing benchmark white paper as a part of [Action to craft a Project Proposal] Benchmarking Whitepaper cncf/tag-env-sustainability#327 ?

@SamYuan1990
Copy link
Collaborator Author

@sunya-ch , @rootfs , @marceloamaral can we use https://github.com/medyagh/setup-minikube to set up minikube for kepler model server training or kepler validation process instead of KIND(k8s in docker), wdyt?
if yes, any volume mount settings?

@SamYuan1990
Copy link
Collaborator Author

SamYuan1990 commented Mar 5, 2024

@rootfs , @marceloamaral let's sync up https://github.com/kubevirt/kubevirt solution for validation here.
my question is, as for model sever we use CPE frame work as workload ... what kind of workload are we going to use for validation?

@SamYuan1990
Copy link
Collaborator Author

once sustainable-computing-io/kepler-action#108 been merged, we will try to use latest kepler-action to integrate with kepler-model-server.

@sunya-ch
Copy link
Contributor

@rootfs , @marceloamaral let's sync up https://github.com/kubevirt/kubevirt solution for validation here.
my question is, as for model sever we use CPE frame work as workload ... what kind of workload are we going to use for validation?

@SamYuan1990 Now, the CPE is obsoleted and we now use tekton task/pipeline to run the stress-ng workload and then collect the data. The stress workload includes stressing the CPU up to 100% of all cores.

@sthaha
Copy link
Contributor

sthaha commented Apr 10, 2024

@SamYuan1990

Based on the discussion about validating the Model here is the setup we want to achieve validation is as follows

Single Bare Metal

  • Bare metal should run VM
  • both bare metal and vm should have kepler running

Kepler on Bare Metal

  • intel-rapl (power - meter)
  • acpi

Kepler on VM

  • kepler (linear regression model weights only)
  • kepler + estimator + (model)
  • kepler + estimator + model-server

@sunya-ch
Copy link
Contributor

We should break down the task according to this issue into separated issues to track the progress.
I created a project for power model validation here:
https://github.com/orgs/sustainable-computing-io/projects/6/views/1

@sunya-ch
Copy link
Contributor

We should decide whether to continue working for process via tekton or utilize the metal-ci. This issue is overlapped with sustainable-computing-io/kepler#1910.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants