Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to use the kubernetes scheduler for workflow pods #111

Merged
merged 5 commits into from
Oct 31, 2023

Conversation

Wielewout
Copy link
Contributor

As described in #112 workflow pods aren't going via the kubernetes scheduler, resulting in failed "initialize containers" steps if the already selected node is out of resources. With the following changes I was able to get this working.

This PR adds two new environment variables: ACTIONS_RUNNER_USE_KUBE_SCHEDULER and ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS.

When set to true, the former skips setting the node name and let's the kubernetes scheduler handle the workflow pod. This would only work on a single node cluster (important for the newly added test) or when ReadWriteMany volumes are used.

The latter can be used to increase the timeout to for example 6 hours from the default 10 minutes. This will be necessary for scenarios where the cluster runs out of resources and we want to handle this more gracefully. When using the kubernetes scheduler the pod will remain pending until resources are available somewhere in the cluster.

Also a quick fix was added to skip the name override warning which is always there when using the pod template.

This should only be used when rwx volumes are supported or when using a single node cluster.
If the kube scheduler is used to hold jobs until sufficient resources are available,
then prepare job needs to wait for a longer period until the workflow pod is running.
This timeout will mostly need an increase in cases where many jobs are triggered
which together exceed the resources available in the cluster.
The workflows can then be gracefully handled later when sufficient resources become available again.
@Wielewout Wielewout requested review from a team as code owners October 23, 2023 07:59
@Wielewout
Copy link
Contributor Author

@nikola-jokic addressed the comments.

I had the idea yesterday that this should maybe be taken further with pod affinity (for a separate PR?). ReadWriteOnce volumes will still always fail early as the workflow pod will still always get the node name assigned. But we can go through the scheduler and get on the same node with pod affinity by attracting the workflow pod to the runner pod.

The problem now still is that a unique label should be added to the runner pod (e.g. actions.github.com/runner-name=<ephemeral-runner-name-also-pod-name>) within actions-runner-controller when an ephemeral runner is creating its pod. The hooks can access the name via the environment (ACTIONS_RUNNER_POD_NAME) to inject the name in the affinity expression (which then needs to be underneath requiredDuringSchedulingIgnoredDuringExecution for ReadWriteOnce volumes). Then the kube scheduler can use the label to schedule the workflow pod on the same node as the runner or it will keep the pod pending until resources are available.

Let me know if this is something I should look into a bit further 😄

@guillaumevillemont
Copy link

guillaumevillemont commented Oct 25, 2023

FTR I've been testing this MR for the last couple of days and, except some fiddling with EFS permissions in RWX mode, it works just fine and solves my issue. Can't wait to see this merged, thanks a lot 🙏

If you're looking for some examples to document, here is my EFS config in EKS that seems to be bulletproof :

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-dynamic-provisioning
  annotations:
    storageclass.kubernetes.io/is-default-class: 'false'
provisioner: efs.csi.aws.com
parameters:
  basePath: /dynamic_provisioning
  directoryPerms: '777'
  ensureUniqueDirectory: 'true'
  fileSystemId: fs-xxxxxxxxxx
  uid: '0'
  gid: '0'
  provisioningMode: efs-ap
  reuseAccessPoint: 'false'
  subPathPattern: ${.PVC.namespace}/${.PVC.name}
reclaimPolicy: Delete
volumeBindingMode: Immediate

uid/gid to "0" is required by the workflow pod (when using action/checkout@v3 for example, which expects filesystem to be owned by user, so root)
and perms 777 so runner pod can write to _work while running with uid:1001 gid:123.

@israel-morales
Copy link

israel-morales commented Oct 27, 2023

Excellent contrib. I will wait for the merge/release before testing on GKE.

This solution is much more elegant than what I would have suggested, which was to attach the -workflow pod to the runner via a sidecar. The sidecar approach would also address the volumeMount dependency and ensure both containers are properly scheduled by the kubeScheduler as a single pod, respecting container-level requests/limits. However, I'm sure there are pitfalls with the sidecar logic.

This PR essentially decouples the runner from the container workflow, and IMO, should be the default configuration, with a caveat of having the RWM as a prerequisite.

Looking forward to this release. Thanks!

@johnjeffers
Copy link

I've recently come to use Actions after being in Gitlab for a few years. The way Gitlab handles self-hosted runners in k8s is simple. You can specify whatever docker image you like in the pipeline code, and that's the one it uses for the runner. There's no concept of runner+worker, no 2nd pod at all. If you want a Service, like postgres, or redis, or even DinD, it's a sidecar in that pod. So there's no need for ephemeral storage, and no worries about scheduling because everything happens in a single pod.

I hope that doesn't come off as coming in here and complaining "why can't you be like Gitlab". That's not my intention at all. I would just like to understand why things are this way with Actions runners. The runner+worker pattern makes configuring and using self-hosted k8s runners a lot more complicated than I was expecting.

Copy link
Contributor

@nikola-jokic nikola-jokic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great change thank you!

@nikola-jokic
Copy link
Contributor

Hey @johnjeffers,

Thank you for raising the question! The way GitHub runner is structured is to have a listener and the worker. The RunnerListener picks up the job, evaluates it, and then passes it down to the worker. That way, the runner can be ready to receive a job before the job is known to the runner (while the runner is idle). What I understood from your message would be equivalent to pulling out the listener's responsibilities inside the controller and then creating a runner worker with side-car containers. Since the listener is part of the runner, the same runner can be configured to run docker or hook commands. Having the listener accept the job allows us to pre-emptively scale up or down (min-runners field in ARC), so as soon as the job is acquired, the runner would be able to pick it up. It is a trade-off with volumes, but the runner is designed to work on any kind of machine, being inside the virtual machine, or inside the k8s.

@nikola-jokic
Copy link
Contributor

@Wielewout as for the suggestion here, it does sound good to me, but it will slightly complicate the hook. The hook was meant to be a simple starting point. However, the suggestion is great in my opinion. So I will propose it to the team and see if there are any objections. Again, thank you for these awesome contributions!
And thank you, @guillaumevillemont, for raising this issue and testing it out!

@israel-morales
Copy link

israel-morales commented Nov 23, 2023

@Wielewout @nikola-jokic regarding the suggestion here, is there a way to do this with the existing implementation?

I'm afraid RWX is not an option for us, as the minimum size for a single dynamic GCP filestore csi persistent volume is 1TiB 😞.

Can the actions.github.com/scale-set-name and pod-template-hash labels be used somehow? Thanks!

> k get po --show-labels
NAME                                             READY   STATUS    RESTARTS   AGE   LABELS
gke-runner-default-rp68p-runner-vw49d            1/1     Running   0          36m   actions-ephemeral-runner=True,actions.github.com/organization=companyb,actions.github.com/scale-set-name=gke-runner-default,actions.github.com/scale-set-namespace=gh-actions-runners,app.kubernetes.io/component=runner,app.kubernetes.io/part-of=gha-runner-scale-set,app.kubernetes.io/version=0.7.0,pod-template-hash=74647967fb

@nikola-jokic
Copy link
Contributor

With the current implementation of the hook, it is not. We are setting the node name explicitly. I think that having this feature would require a trivial change on the ARC side to set the label of the pod and potentially an env. Then with the hook extension I suppose it could be done, but I have to say it would be tricky. I will bring this solution up to the team

@isker
Copy link

isker commented Apr 6, 2024

I would just like to understand why things are this way with Actions runners. The runner+worker pattern makes configuring and using self-hosted k8s runners a lot more complicated than I was expecting.

I'm in the middle of trying to switch a self-hosted k8s CI setup from GitLab to GitHub myself. While I certainly don't know why ARC is architected exactly how it is, I can say that the requirements imposed by the feature set of GitHub actions demand a more complex implementation.

The biggest requirement that I can see is that a single job can involve multiple containerized actions executed in series (in the job's steps), each working on the same filesystem (which belongs to the job). And with the kubernetes hooks in this repo, this is done with PersistentVolumes. As far as I know, GitLab has nothing like this.

I've seen many more usages of PersistentVolumes in the k8s hooks implementation, but that one seems to be the most onerous, from an implementer's perspective. I am dreaming of an implementation of the hooks that simply doesn't support features like that one, so that PVs could be ditched 🌞. It seems that the maintainers have tried to find alternatives to using PVs, to no avail.

As it is, I agree with you that it seems pretty hard to achieve the same kinds of outcomes with ARC's k8s mode as you could with GitLab's k8s runner, with respect to things like autoscaling and scheduling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants