workflow runner pods fail instantly if pods are unschedulable #3647

jonathan-fileread · 2024-07-05T21:36:58Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.2

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Install ARC Controller + Runner set 0.9.2
define ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE with the podTemplate, and containerMode: "Kubernetes"
define a pod template like this

apiVersion: v1
data:
  default.yml: |
    "apiVersion": "v1"
    "kind": "PodTemplate"
    "metadata":
      "name": "runner-pod-template"
    "spec":
      "containers":
      - "name": "$job"
        "resources":
          "limits":
            "cpu": "3000m"
          "requests":
            "cpu": "3000m"

Describe the bug

GHA jobs fail instantly if a pod is unscheduable due to waiting for node to become available (if the resource request for CPU/Memory is high, waiting for the node autoscaler)

Describe the expected behavior

There should be a timeout field either in the runner set or container hooks podtemplate that allows the workflow pod to wait for x minutes till the pod is scheduled after another node is alive.

Additional Context

template:
  spec:
    initContainers:
      - name: kube-init
        image: ghcr.io/actions/actions-runner:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            sudo chown -R 1001:123 /home/runner/_work
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
    securityContext:
      fsGroup: 123 ## needed to resolve permission issues with mounted volume. https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors#error-access-to-the-path-homerunner_work_tool-is-denied
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
          value: /home/runner/pod-templates/default.yml
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"  ## To allow jobs without a job container to run, set ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER to false on your runner container. This instructs the runner to disable this check.
        volumeMounts:
        - name: pod-templates
          mountPath: /home/runner/pod-templates
          readOnly: true
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              storageClassName: "managed-csi"
              resources:
                requests:
                  storage: ${local.volume_claim_size}
      - name: pod-templates
        configMap:
          name: "runner-pod-template"
    

containerMode:
  type: "kubernetes"  ## type can be set to dind or kubernetes
  ## the following is required when containerMode.type=kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    # For local testing, use https://github.com/openebs/dynamic-localpv-provisioner/blob/develop/docs/quickstart.md to provide dynamic provision volume with storageClassName: openebs-hostpath
    storageClassName: "managed-csi"
    resources:
      requests:
        storage: 50Gi


Pod Template YAML:
apiVersion: v1
data:
  default.yml: |
    "apiVersion": "v1"
    "kind": "PodTemplate"
    "metadata":
      "name": "runner-pod-template"
    "spec":
      "containers":
      - "name": "$job"
        "resources":
          "limits":
            "cpu": "3000m"
          "requests":
            "cpu": "3000m"

Controller Logs

https://gist.github.com/jonathan-fileread/602f6d5fd948bf505a2fa7f5dbd78069

Runner Pod Logs

https://gist.githubusercontent.com/jonathan-fileread/96db9941abc5faba985aae78ef6b3760/raw/196644c97c7698e51bf6ae9b50dbf769dd4f1825/gistfile1.txt

github-actions · 2024-07-05T21:37:24Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

ropelli · 2024-07-10T12:38:53Z

Similar issue: #3630

jonathan-fileread added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflow runner pods fail instantly if pods are unschedulable #3647

workflow runner pods fail instantly if pods are unschedulable #3647

jonathan-fileread commented Jul 5, 2024

github-actions bot commented Jul 5, 2024

ropelli commented Jul 10, 2024

workflow runner pods fail instantly if pods are unschedulable #3647

workflow runner pods fail instantly if pods are unschedulable #3647

Comments

jonathan-fileread commented Jul 5, 2024

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

github-actions bot commented Jul 5, 2024

ropelli commented Jul 10, 2024