`containerMode` option to allow running jobs in k8's instead of docker #1546

thboop · 2022-06-17T00:00:32Z

Description

This PR adds the ability to use the runner container hooks to invoke container jobs and steps via kubernetes, rather then relying on the docker implementation. This allows us to avoid the use of privileged containers while still being able to run container scenarios. By running the job in a separate pod (not on the runner pod), we can safely pass a service account with elevated permissions to the runner pod, so it can dynamically spin up pods and k8s jobs to run the workflow.

If you set containerMode to kubernetes, you will be required to provide a service account with the appropriate permissions, and a storage class we can provision the runner working directory from. More details are in the readme updates in this PR.

Volume mount

A work volume mount is used to pass the working directory between pods. By default, we require that the mounts we spin up for jobs exist on the same worker node as the runner pod. If you use a storage class with ReadWriteMany, we will schedule it on any available worker node.

Decisions

We could have gone with the manager creating service accounts with the associated permissions, rather then requiring the service account. However, I was worried about increasing the complexity of the manager, and increasing the permissions it had, just for the purposes of creating service accounts. Eventually, I could see us having helm charts for runners alongside the ARC helm chart, and in that scenario we can move service account creation into those. However, I am open to the alternative, and not requiring service accounts and allowing ARC to create them.
Why do we require secrets scope?
- We use secrets in two places, once to pass the env into steps because it could contain secrets, and also when pulling private registry images.
- While we do clean these up in the runner container hooks, if the runner pod is killed before the runner is able to clean them up, we need to clean them up in a central location, so we do it in ARC when downscaling runners.

Added container mode "kubernetes" envs

update errors

…ctions-runner-controller into nikola-jokic/work-volume-mounts

… for k8s

update manifests

…pods added concurrent cleanup before runner pod is deleted

…od-delete Check linked pod is in deleting phase status to skip it

mumoshu · 2022-06-24T01:29:02Z

api/v1alpha1/runner_webhook.go

@@ -76,6 +76,16 @@ func (r *Runner) Validate() error {
 		errList = append(errList, field.Invalid(field.NewPath("spec", "repository"), r.Spec.Repository, err.Error()))
 	}

+	err = r.Spec.ValidateWorkVolumeClaimTemplate()


Thanks for adopting the existing pattern of ValidateFoo here!

This made me realize that we might want to introduce another centralized function like RunnerSpec.Validate() that calls out to the three Validate* functions in the future, so that it's a bit more maintainable(in terms of less chance to accidentally miss adding another new ValidateFoo call to one out of three places)

// Just take this as a pure comment, not a request for change. It's out of the scope of this pull request. I'd greatly appreciate it if you could submit another pr for that though!

That makes a lot of sense, I'll follow up with a separate pr for that if that works for you!

mumoshu · 2022-06-28T04:41:19Z

controllers/runner_pod_controller.go

+	r.List(ctx, &runnerLinkedPodList, client.MatchingLabels(
+		map[string]string{
+			"runner-pod": pod.ObjectMeta.Name,
+		},
+	))


I believe we need to handle the potential error here. Also, the lack of client.InNamespace seems to result in errors like Failed to watch *v1.Secret: failed to list *v1.Secret: secrets is forbidden: User "system:serviceaccount:actions-runner-system:actions-runner-controller" cannot list resource "secrets" in API group "" at the cluster scope.
I also noticed that our helm chart doesn't provide the rbac resources to let ARC list and get secrets, which resulted in the runnerpod reconcile hanging after this line(perhaps it panicked silently? Not sure, but anyway this function didn't either return any error or continue processing other terimnating pods)

9cd1272 fixes these issues!

mumoshu · 2022-06-28T04:46:17Z

Let me add a fix for this small nit: 758c2a3

mumoshu

After you've enabled the feature flag on the backend, I was finally able to verify that it's working!
Here's my workflow definition:

jobs:
  test0:
    runs-on: test-fzowlprkxh
    container: "golang:1.18"
    steps:
    - uses: actions/checkout@v2
    - run: go version
    - run: go build .
name: E2E TestE2E fzowlprkxh
"on":
  push:
    branches:
    - main

It's interesting to see that, even though I've set container: "golang:1.18" for running various go commands, somehow nodejs-based actions like actions/checkout@v2 seem to be working as well! In the workflow logs I see it called out to /runner/k8s/index.js so perhaps it created a dedicated nodejs pod for running the action?

Anyway, it did work as advertised and I love it! Thank you for all your efforts to make it happen 🎉

Follow-up for #1546

* Use a dedicated pod label to say it is a runner pod Follow-up for #1546 * Fix PercentageRunnersBusy scaling delay Ref #1374

tedchang77 · 2022-07-01T01:52:08Z

we're currently using gke workload identity to give our runners access to gcp. with this new containerMode there doesn't seem like there is a way to pass a ksa to the pod that runs the container job?

mumoshu · 2022-07-01T02:20:33Z

@tedchang77 Yep! I think so. Honestly, I'm even unsure how it could work or how should ARC support that. There's a similar feature for AWS and it won't work. All the data that the runner and job pods share should be contained within the PV-backed work dir. Can you copy the identity-related file(s) to the work dir and point the K8s client (or kubectl run within a container-based job step) to it?

mumoshu · 2022-07-01T02:24:26Z

@tedchang77 BTW, could you clarify a bit on why you need the workload identity in the kubernetes container mode? If all you need is to run e.g. kubectl with the workload identity, you might be able to just disable dind and set privileged: false, use setup-kubectl action to install kubectl within the runner container and everything should work fine. In other words, your use-case might not require this kubernetes container mode.

tedchang77 · 2022-07-01T02:42:02Z

Our jobs need to deploy gcp resources like vpcs, gke, cloud sql, etc.. which require gcp credentials and we don't want to store and pass in long lived SA keys through Env variables. For deployments to gke even if we download kubectl or have it baked in the image we stil need to have gcp creds.

tedchang77 · 2022-07-01T02:48:29Z

One workaround that I can think of -not sure how good it is- is to create a separate namespace for each runner and use workload identity to map the default ksa to a gcp sa. If the runner creates the pod and doesnt specify a Ksa it will be assigned the default ksa for the namespace and have the correct gcp permissions. The runner pod wouldn't use workload identity and it's ksa would only have permissions to create the job container pods. No need to have any long lived gcp credentials with this solution.

mumoshu · 2022-07-01T04:30:02Z

@tedchang77 Yeah that might work if you need to use this kubernetes container mode. Or perhaps you don't even need a new namaepsace. The updated RunnerDeployment spec accepts serviceAccountName which corresponds to the serviceaccount passed to job (not runner) pods. You might associated the workload identity to that serviceaccount and that way any commands like kubectl will have access to the identity related files associated to the serviceaccount you specified.

But it would just work if you used privileged: false and run your deploy scripts within the runner pod, without the kubernetes container mode at all. Could you confirm?

tedchang77 · 2022-07-01T11:54:40Z

@mumoshu my preference is to separate the runner image from the container job image but your suggestion should work. we'll try this next week and report back. thanks for all the help!

mumoshu · 2022-07-01T12:24:36Z

@tedchang77 Makes sense. Thanks for the clarification and your support! I'm looking forward to hearing back from you

tedchang77 · 2022-07-08T22:15:34Z

@mumoshu our custom runner image has all of our tool installed now and we can deploy using the custom runner image and the SA that we configured using GKE workload identity.

i think it's still possible to use k8s container mode but would require us to:

use workload identity with the default k8s SA for the namespace
use LimitRanges to set default requests in autopilot since if they are not set the defaults are not sufficient for us: https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#default_container_resource_requests

if the runner-container-hooks project implements: actions/runner-container-hooks#1 it would solve the second issue. the challenge with k8s container mode is that some of the k8s configurations, like k8s SAs, don't have an equivalent in the docker container options: https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idcontaineroptions

for now, we will use custom runner but if we implement container mode i'll report back on the results. thanks!

mumoshu · 2022-07-09T01:41:37Z

@tedchang77 Hey! Thanks for the feedback.

the challenge with k8s container mode is that some of the k8s configurations, like k8s SAs, don't have an equivalent in the docker container options: https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idcontaineroptions

I'm interested in this part! May I ask your goal for that once again? Do you want job container pods to share the SA with the runner pod, and/or want to separate SA per each job container pod?

My impression was your goal is the former. And, the former could be implemented using some mechanism to copy the SA token or any identity-related information from the runner pod to the job container pods.

Probably that's within the scope of ARC and can be a valid feature request to ARC if we could agree on the exact requirements of the feature.

If you could try building a PoC of it, you could modify entrypoint.sh of the runner image so that the script copies the SA token and/or other identity-related files from perhaps /var/run/secrets/kubernetes.io/serviceaccounttoken/* to the "work" directory.

The "work" directory is shared across the runner and the job container pods. In your workflow definition, you could prepend a "cp" to copy the files from the work dir to /var/run/secrets/kubernetes.io/serviceaccounttoken/* of the job container pod, so that a command like kubectl could use those for K8s API authentication. Paths might be incorrect but the strategy should generally work.

audunsolemdal · 2022-08-12T13:54:21Z

This allows us to avoid the use of privileged containers while still being able to run container scenarios

@thboop Could you elaborate what you mean by this?

I currently have the following workflow:
Step 1 (Github hoster runner): docker build and tag image, upload image artifact
Step 2 (Self hosted runner as RunnerDeployment) - Download image artifact run docker load, docker push the loaded container image to a private container registry

My goal is to avoid running any --privileged containers related to the workflow. Is this possible somehow?

tedchang77 · 2023-01-09T14:52:08Z

@tedchang77 Hey! Thanks for the feedback.

the challenge with k8s container mode is that some of the k8s configurations, like k8s SAs, don't have an equivalent in the docker container options: https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idcontaineroptions

I'm interested in this part! May I ask your goal for that once again? Do you want job container pods to share the SA with the runner pod, and/or want to separate SA per each job container pod?

My impression was your goal is the former. And, the former could be implemented using some mechanism to copy the SA token or any identity-related information from the runner pod to the job container pods.

Probably that's within the scope of ARC and can be a valid feature request to ARC if we could agree on the exact requirements of the feature.

If you could try building a PoC of it, you could modify entrypoint.sh of the runner image so that the script copies the SA token and/or other identity-related files from perhaps /var/run/secrets/kubernetes.io/serviceaccounttoken/* to the "work" directory.

The "work" directory is shared across the runner and the job container pods. In your workflow definition, you could prepend a "cp" to copy the files from the work dir to /var/run/secrets/kubernetes.io/serviceaccounttoken/* of the job container pod, so that a command like kubectl could use those for K8s API authentication. Paths might be incorrect but the strategy should generally work.

apologies @mumoshu but I had to put this on hold as some other priorities came up.

i found this PR which looks like it will solve the problem once its merged.

Nikola Jokic and others added 30 commits May 19, 2022 12:47

added containerMode=kubernetes env variables to the runner

68678d2

removed unused logging

0ef7a7b

restored configs and charts

b0f745e

Merge branch 'master' into nikola-jokic/container-mode-kubernetes

6a41bf0

restored makefile cert version and acceptance/run

dd9f32e

added workVolumeClaimTemplate in pod definition, including logic

dbd82bb

added claim template name based on the runner

3a68afa

Merge pull request #8 from thboop/nikola-jokic/container-mode-kubernetes

362900a

Added container mode "kubernetes" envs

Apply suggestions from code review

c1a29b3

update errors

added concurrent cleanup before runner pod is deleted

a3b534f

Merge branch 'actions-runner-controller:master' into master

e045819

update manifests

effdf87

added retry after 30s if pod cleanup contains err

5bcf7a7

Merge branch 'master' into nikola-jokic/work-volume-mounts

a4b9f52

Merge branch 'nikola-jokic/work-volume-mounts' of github.com:thboop/a…

0f0df81

…ctions-runner-controller into nikola-jokic/work-volume-mounts

added admission webhook check, made workVolumeClaimTemplate mandatory…

c07a83e

… for k8s

style changes and added comments

7d78579

Merge pull request #12 from thboop/thboop/fixtests

e78c297

update manifests

Merge pull request #11 from thboop/nikola-jokic/cleanup-runner-liked-…

6ab63f9

…pods added concurrent cleanup before runner pod is deleted

added izZero timestamp check for deleting runner-linked pods

b08d819

changed order of local variable to avoid copy if p is deleted

d6dbd9e

Merge pull request #13 from thboop/nikola-jokic/status-check-linked-p…

a4063b4

…od-delete Check linked pod is in deleting phase status to skip it

Merge branch 'actions-runner-controller:master' into master

2e74e90

Merge branch 'master' into nikola-jokic/work-volume-mounts

aaed579

removed docker from container mode k8s

7c0e988

restored charts, config, makefile

adafe5f

restored forked files back and not the ARC ones

4cec780

created PersistentVolume on containerMode k8s

5468d05

create pv only if storage class name is local-storage

a86f7c2

Merge branch 'actions-runner-controller:master' into master

a354586

Update runner/Makefile

8e2e737

mumoshu reviewed Jun 24, 2022

View reviewed changes

toast-gear added this to the v0.25.0 milestone Jun 24, 2022

Merge branch 'actions-runner-controller:master' into master

36195a8

mumoshu reviewed Jun 28, 2022

View reviewed changes

mumoshu added 2 commits June 28, 2022 04:41

Fix missing secret permission and the error handling

9cd1272

Fix a runnerpod reconciler finalizer to not trigger unnecessary retry

758c2a3

mumoshu approved these changes Jun 28, 2022

View reviewed changes

mumoshu merged commit 0386c07 into actions:master Jun 28, 2022

mumoshu added a commit that referenced this pull request Jun 29, 2022

Use a dedicated pod label to say it is a runner pod

7ece4d9

Follow-up for #1546

nikola-jokic mentioned this pull request Jun 29, 2022

Extracted validations to a single point for runner spec #1582

Merged

mumoshu mentioned this pull request Jun 29, 2022

feat: allow to discover runner statuses #1268

Merged

mumoshu added a commit that referenced this pull request Jun 29, 2022

Fix PercentageRunnersBusy scaling delay (#1579)

8161136

* Use a dedicated pod label to say it is a runner pod Follow-up for #1546 * Fix PercentageRunnersBusy scaling delay Ref #1374

mumoshu mentioned this pull request Jul 19, 2022

Add rootless DinD runner #1644

Merged

mumoshu mentioned this pull request Jul 31, 2022

Running docker in docker as a privileged container enables you to mount kubernetes node host filesystem and gain root access. #1288

Closed

mumoshu mentioned this pull request Apr 3, 2023

cgroupv2 is not respecting dockerdContainerResources #2284

Open

rpenziol mentioned this pull request Jun 2, 2023

containerMode: kubernetes error "Object reference not set to an instance of an object." #2641

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`containerMode` option to allow running jobs in k8's instead of docker #1546

`containerMode` option to allow running jobs in k8's instead of docker #1546

thboop commented Jun 17, 2022

mumoshu Jun 24, 2022 •

edited

Loading

thboop Jun 24, 2022

mumoshu Jun 28, 2022

mumoshu Jun 28, 2022

mumoshu commented Jun 28, 2022

mumoshu left a comment •

edited

Loading

tedchang77 commented Jul 1, 2022

mumoshu commented Jul 1, 2022

mumoshu commented Jul 1, 2022

tedchang77 commented Jul 1, 2022

tedchang77 commented Jul 1, 2022 •

edited

Loading

mumoshu commented Jul 1, 2022 •

edited

Loading

tedchang77 commented Jul 1, 2022

mumoshu commented Jul 1, 2022 •

edited

Loading

tedchang77 commented Jul 8, 2022

mumoshu commented Jul 9, 2022

audunsolemdal commented Aug 12, 2022

tedchang77 commented Jan 9, 2023

containerMode option to allow running jobs in k8's instead of docker #1546

containerMode option to allow running jobs in k8's instead of docker #1546

Conversation

thboop commented Jun 17, 2022

Description

Volume mount

Decisions

mumoshu Jun 24, 2022 • edited Loading

Choose a reason for hiding this comment

thboop Jun 24, 2022

Choose a reason for hiding this comment

mumoshu Jun 28, 2022

Choose a reason for hiding this comment

mumoshu Jun 28, 2022

Choose a reason for hiding this comment

mumoshu commented Jun 28, 2022

mumoshu left a comment • edited Loading

Choose a reason for hiding this comment

tedchang77 commented Jul 1, 2022

mumoshu commented Jul 1, 2022

mumoshu commented Jul 1, 2022

tedchang77 commented Jul 1, 2022

tedchang77 commented Jul 1, 2022 • edited Loading

mumoshu commented Jul 1, 2022 • edited Loading

tedchang77 commented Jul 1, 2022

mumoshu commented Jul 1, 2022 • edited Loading

tedchang77 commented Jul 8, 2022

mumoshu commented Jul 9, 2022

audunsolemdal commented Aug 12, 2022

tedchang77 commented Jan 9, 2023

`containerMode` option to allow running jobs in k8's instead of docker #1546

`containerMode` option to allow running jobs in k8's instead of docker #1546

mumoshu Jun 24, 2022 •

edited

Loading

mumoshu left a comment •

edited

Loading

tedchang77 commented Jul 1, 2022 •

edited

Loading

mumoshu commented Jul 1, 2022 •

edited

Loading

mumoshu commented Jul 1, 2022 •

edited

Loading