Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow runner pods fail instantly if pods are unschedulable #3647

Open
4 tasks done
jonathan-fileread opened this issue Jul 5, 2024 · 2 comments
Open
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@jonathan-fileread
Copy link

Checks

Controller Version

0.9.2

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Install ARC Controller + Runner set 0.9.2
define ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE with the podTemplate, and containerMode: "Kubernetes"
define a pod template like this

apiVersion: v1
data:
  default.yml: |
    "apiVersion": "v1"
    "kind": "PodTemplate"
    "metadata":
      "name": "runner-pod-template"
    "spec":
      "containers":
      - "name": "$job"
        "resources":
          "limits":
            "cpu": "3000m"
          "requests":
            "cpu": "3000m"

Describe the bug

GHA jobs fail instantly if a pod is unscheduable due to waiting for node to become available (if the resource request for CPU/Memory is high, waiting for the node autoscaler)

Screenshot 2024-07-05 at 5 13 03 PM
Screenshot 2024-07-05 at 5 13 36 PM

Describe the expected behavior

There should be a timeout field either in the runner set or container hooks podtemplate that allows the workflow pod to wait for x minutes till the pod is scheduled after another node is alive.

Additional Context

template:
  spec:
    initContainers:
      - name: kube-init
        image: ghcr.io/actions/actions-runner:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            sudo chown -R 1001:123 /home/runner/_work
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
    securityContext:
      fsGroup: 123 ## needed to resolve permission issues with mounted volume. https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors#error-access-to-the-path-homerunner_work_tool-is-denied
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
          value: /home/runner/pod-templates/default.yml
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"  ## To allow jobs without a job container to run, set ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER to false on your runner container. This instructs the runner to disable this check.
        volumeMounts:
        - name: pod-templates
          mountPath: /home/runner/pod-templates
          readOnly: true
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              storageClassName: "managed-csi"
              resources:
                requests:
                  storage: ${local.volume_claim_size}
      - name: pod-templates
        configMap:
          name: "runner-pod-template"
    

containerMode:
  type: "kubernetes"  ## type can be set to dind or kubernetes
  ## the following is required when containerMode.type=kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    # For local testing, use https://github.com/openebs/dynamic-localpv-provisioner/blob/develop/docs/quickstart.md to provide dynamic provision volume with storageClassName: openebs-hostpath
    storageClassName: "managed-csi"
    resources:
      requests:
        storage: 50Gi


Pod Template YAML:
apiVersion: v1
data:
  default.yml: |
    "apiVersion": "v1"
    "kind": "PodTemplate"
    "metadata":
      "name": "runner-pod-template"
    "spec":
      "containers":
      - "name": "$job"
        "resources":
          "limits":
            "cpu": "3000m"
          "requests":
            "cpu": "3000m"

Controller Logs

https://gist.github.com/jonathan-fileread/602f6d5fd948bf505a2fa7f5dbd78069

Runner Pod Logs

https://gist.githubusercontent.com/jonathan-fileread/96db9941abc5faba985aae78ef6b3760/raw/196644c97c7698e51bf6ae9b50dbf769dd4f1825/gistfile1.txt
@jonathan-fileread jonathan-fileread added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jul 5, 2024
Copy link
Contributor

github-actions bot commented Jul 5, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@ropelli
Copy link

ropelli commented Jul 10, 2024

Similar issue: #3630

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

2 participants