-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][brainstorming] make model training as a github action based on tekton #212
Comments
We might prepare another GitHub workflow on specific branch name for pushing a PR with result from their COS to kepler-model-db. The steps on my thought are
@SamYuan1990 Do you want to work on this? Note:
|
let's keep collect requirements and ideas in this ticket. |
![]() here is my plan. @rootfs , @sunya-ch , @marceloamaral
Which all those 3 topics basing on our current deployment stack. which also applied for a self hosted instance.(@jiere here)
Hence to achieve that, we need to build new and enhancement with current our CI toolings.
Let's start with Tekton based training. About test and verifications 3rd, a green pipeline
if we assume kepler is a workload or a running job for greening CI/CD pipeline. Or in another point of view, running a kepler 's benchmark testing is a part of workload as same as a traffic load running on k8s. Which specific is that the workload is from kepler itself. :-) |
Thank you for started this planning. It seems many points to discuss but let me first start with the requirement for the power modeling. CICD Test cases for each environment(A) Test case for BM0. setup environmentAgree to what you planned:
Currently, I reuse the code from kepler-model-server/.github/workflows/train-model-self-hosted.yml Lines 106 to 141 in 055f537
1. verify feature inputs from Kepler (input)
2. verify model training process (process)
3. verify trained model results (output)
(B) Test case for VM1. verify feature inputs from Kepler (input)
2. verify estimator (output)
Integrationtrained model deliveryNow, we have CI to push model to kepler project AWS s3 after train kepler-model-server/model_training/tekton/pipelines/single-train.yaml Lines 275 to 309 in 055f537
|
We also have to think CI pipeline for notifying changes that requires changes and support on the other repo. For example, kepler changes metrics (name, labels, values) --> notify kepler-model-server FYI, simplified communication diagram between three repos will be updated to README page by #223 |
@sunya-ch , your latest comments just for kepler and kepler-model-server? could you please adding other project such as peaks as consideration ? I am interested with what will be.... when we add peaks into consideration.... and how many components we can reuse. |
I think we also need people for peak project to list up their requirements. We can prepare an action to reuse integration test with inputs of kepler image, model_server image, and deployment choice. There are multiple ways to install: 1. by operator 2. by manifests 3. by helm-chart. We may need to prepare all of them to test the integration test.
|
some todo item after review kepler CI fix at sustainable-computing-io/kepler#1239
|
some ideas for self host instance repo, IMO, suggestions below, aiming at use an ansible playbook to set up k8s cluster among 3 ec2 instance created by self host instance GHA.
is there any GHA to set up a k8s cluster via ansible or other CI tools we can reuse, or OCP, container ready? @rootfs wdyt |
extend local-dev-cluster with Prometheus operator, tekton targeting to a specific k8s cluster and decouple with kind cluster. |
@rootfs , @jiere , @sunya-ch wdyt if we have a repo for kepler validation and kepler model server validation, the new repo contains
which
|
@sunya-ch , @rootfs , @marceloamaral can we use https://github.com/medyagh/setup-minikube to set up minikube for kepler model server training or kepler validation process instead of KIND(k8s in docker), wdyt? |
@rootfs , @marceloamaral let's sync up https://github.com/kubevirt/kubevirt solution for validation here. |
once sustainable-computing-io/kepler-action#108 been merged, we will try to use latest kepler-action to integrate with kepler-model-server. |
@SamYuan1990 Now, the CPE is obsoleted and we now use tekton task/pipeline to run the stress-ng workload and then collect the data. The stress workload includes stressing the CPU up to 100% of all cores. |
Based on the discussion about validating the Model here is the setup we want to achieve validation is as follows Single Bare Metal
Kepler on Bare Metal
Kepler on VM
|
We should break down the task according to this issue into separated issues to track the progress. |
We should decide whether to continue working for process via tekton or utilize the metal-ci. This issue is overlapped with sustainable-computing-io/kepler#1910. |
as a brainstorming, if we make model training as a github action which just base on tekton, we can benefits others to provide their training result to us? as he/she can run the github action on their own self hosted github runner targeting with their own k8s cluster with tekton.
kepler-model-server/.github/workflows/train-model-self-hosted.yml
Lines 142 to 178 in 0609df4
The text was updated successfully, but these errors were encountered: