Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasionally jobs are stuck in "waiting for a runnner to come online" #3649

Open
4 tasks done
victor-smg opened this issue Jul 8, 2024 · 7 comments
Open
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@victor-smg
Copy link

victor-smg commented Jul 8, 2024

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

No clear reproduction steps for the bug, as it is occasional. 
Team have noticed it occurs most often in the morning.

Describe the bug

In the example, on the CI piepline: image

  • At 06:49 A job was suppose to be schedule on gm2.dind.large-b4j7w runner after the job "get-ecr-login-password"
  • During 40 minutes the job "unit-tests" is stuck in waiting for runner
  • At 07:40:43 in the controller we get the log "Created new ephemeral runner", which is named gm2.dind.large-b4j7w-runner-bsk5p. The runner will be use to run the job "unit-tests"

We have multiple cases of jobs waiting a long time for a runner, without any clear explanation.

Describe the expected behavior

  • Jobs is trigger
  • ARC controler create the runner
  • Runner is taking the job and start running it

Additional Context

There has been attempt to play around the 'minWorkers' configuration. of the runner-set to prevent the waiting issue.

Runner-set are deployed with helm using the values.yaml file:

githubConfigUrl: ${github_config_url}

githubConfigSecret:
  github_app_id: "${github_app_id}"
  github_app_installation_id: "${github_app_installation_id}"
  github_app_private_key: |
    ${github_app_private_key}

containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "github-arc"
    resources:
      requests:
        storage: 10Gi

minWorkers: 2

listenerTemplate:
  metadata:
    annotations:
      prometheus.io/scrape: "true"
      prometheus.io/path: "/metrics"
      prometheus.io/port: "8080"
  spec:
    containers:
    - name: listener
      securityContext:
        runAsUser: 1000
      resources:
        limits:
          memory: 64Mi
        requests:
          cpu: 100m
          memory: 64Mi
    nodeSelector:
      "karpenter.sh/nodepool": runner
    tolerations:
      - key: "karpenter.sh/nodepool"
        value: "runner"
        operator: "Equal"
        effect: "NoSchedule"

template:
  metadata:
    annotations:
      prometheus.io/scrape: "true"
      prometheus.io/path: "/metrics"
      prometheus.io/port: "8080"
  spec:
    securityContext:
      fsGroup: 1001
    serviceAccount: "${sa_name}"
    containers:
    - name: runner
      image: REDACTED/github-arc-runner:0.7
      command: ["/home/runner/run.sh"]
      env:
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"
      resources:
        limits:
          cpu: "${resources.cpu}"
          memory: "${resources.memory}"
        requests:
          cpu: "${resources.cpu}"
          memory: "${resources.memory}"
    nodeSelector:
      "karpenter.sh/nodepool": runner
    tolerations:
      - key: "karpenter.sh/nodepool"
        value: "runner"
        operator: "Equal"
        effect: "NoSchedule"

Controller Logs

Logs would be best read when filtered on "gm2.dind.large-b4j7w"

"https://gist.github.com/victor-smg/1c7ed78e9702dd2389e98725c9ac37f5.js"

Runner Pod Logs

No relevant logs for the runner, as it is the lack of runner that is the issue
@victor-smg victor-smg added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jul 8, 2024
Copy link
Contributor

github-actions bot commented Jul 8, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@andresrsanchez
Copy link

We are now facing this issue also, same version.

@krzyzakp
Copy link

Bit more info, maybe will help.
So far we thought it is mostly problem in the morning, but we are impacted by it right now as well. In theory there was just GH Copilot outage in last hour, nothing about Actions or API, but still one of our runners is NOT fetching jobs. Seems like autoscalingrunnersets is not doing it's job correctly.

krzyzakp@X1Carbon:/home/krzyzakp $ k describe -n github-arc autoscalingrunnersets.actions.github.com gm2.small
Name:         gm2.small
Namespace:    github-arc
Labels:       actions.github.com/organization=tutti-ch
              actions.github.com/scale-set-name=gm2.small
              actions.github.com/scale-set-namespace=github-arc
              app.kubernetes.io/component=autoscaling-runner-set
              app.kubernetes.io/instance=gm2.small
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=gm2.small
              app.kubernetes.io/part-of=gha-rs
              app.kubernetes.io/version=0.9.3
              helm.sh/chart=gha-rs-0.9.3
Annotations:  actions.github.com/cleanup-github-secret-name: gm2.small-gha-rs-github-secret
              actions.github.com/cleanup-kubernetes-mode-role-binding-name: gm2.small-gha-rs-kube-mode
              actions.github.com/cleanup-kubernetes-mode-role-name: gm2.small-gha-rs-kube-mode
              actions.github.com/cleanup-kubernetes-mode-service-account-name: gm2.small-gha-rs-kube-mode
              actions.github.com/cleanup-manager-role-binding: gm2.small-gha-rs-manager
              actions.github.com/cleanup-manager-role-name: gm2.small-gha-rs-manager
              actions.github.com/runner-group-name: Default
              actions.github.com/runner-scale-set-name: gm2.small
              actions.github.com/values-hash: <REDACTED in comment>
              meta.helm.sh/release-name: gm2.small
              meta.helm.sh/release-namespace: github-arc
              runner-scale-set-id: 2
API Version:  actions.github.com/v1alpha1
Kind:         AutoscalingRunnerSet
Metadata:
  Creation Timestamp:  2024-05-22T12:57:11Z
  Finalizers:
    autoscalingrunnerset.actions.github.com/finalizer
  Generation:        32
  Resource Version:  326465633
  UID:               b6f45bbd-9c25-475c-b633-ebd4290365e3
Spec:
  Github Config Secret:  gm2.small-gha-rs-github-secret
  Github Config URL:     https://github.com/tutti-ch
  Listener Template:
    Metadata:
      Annotations:
        prometheus.io/path:    /metrics
        prometheus.io/port:    8080
        prometheus.io/scrape:  true
    Spec:
      Containers:
        Name:  listener
        Resources:
          Limits:
            Memory:  64Mi
          Requests:
            Cpu:     100m
            Memory:  64Mi
        Security Context:
          Run As User:  1000
      Node Selector:
        karpenter.sh/nodepool:  runner
      Tolerations:
        Effect:    NoSchedule
        Key:       karpenter.sh/nodepool
        Operator:  Equal
        Value:     runner
  Template:
    Metadata:
      Annotations:
        prometheus.io/path:    /metrics
        prometheus.io/port:    8080
        prometheus.io/scrape:  true
    Spec:
      Containers:
        Command:
          /home/runner/run.sh
        Env:
          Name:   ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          Value:  false
          Name:   ACTIONS_RUNNER_CONTAINER_HOOKS
          Value:  /home/runner/k8s/index.js
          Name:   ACTIONS_RUNNER_POD_NAME
          Value From:
            Field Ref:
              Field Path:  metadata.name
        Image:             668273420038.dkr.ecr.eu-central-1.amazonaws.com/github-arc-runner:0.7
        Name:              runner
        Resources:
          Limits:
            Cpu:     500m
            Memory:  1Gi
          Requests:
            Cpu:     500m
            Memory:  1Gi
        Volume Mounts:
          Mount Path:  /home/runner/_work
          Name:        work
      Node Selector:
        karpenter.sh/nodepool:  runner
      Restart Policy:           Never
      Security Context:
        Fs Group:            1001
      Service Account:       gm.small
      Service Account Name:  gm2.small-gha-rs-kube-mode
      Tolerations:
        Effect:    NoSchedule
        Key:       karpenter.sh/nodepool
        Operator:  Equal
        Value:     runner
      Volumes:
        Ephemeral:
          Volume Claim Template:
            Spec:
              Access Modes:
                ReadWriteOnce
              Resources:
                Requests:
                  Storage:         10Gi
              Storage Class Name:  github-arc
        Name:                      work
Status:
  Current Runners:            23
  Failed Ephemeral Runners:   22
  Pending Ephemeral Runners:  1
  Running Ephemeral Runners:  0
Events:                       <none>

You can clearly see that there are 23 runners, failed are 22, but even if I try to restart job and get a hold of pod, don't see any.
What more surprising, other runners, using exactly same config (except mem/cpu limits) are working fine - even if they're bigger. So don't think there is no node capacity.

On top of that in logs, I see (read from bottom to top):

2024-07-11 17:20:19.366	
2024-07-11T15:20:19Z	INFO	EphemeralRunner	Deleting the ephemeral runner pod	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "podId": "61019e6a-6221-4d3d-8fa3-689cad9a2396"}
2024-07-11 17:20:19.366	
2024-07-11T15:20:19Z	INFO	EphemeralRunner	Ephemeral runner pod has finished, but the runner still exists in the service. Deleting the pod to restart it.	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:19.366	
2024-07-11T15:20:19Z	INFO	EphemeralRunner	Runner exists in GitHub service	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "runnerId": 604044}
2024-07-11 17:20:19.271	
2024-07-11T15:20:19Z	INFO	EphemeralRunner	Checking if runner exists in GitHub service	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "runnerId": 604044}
2024-07-11 17:20:15.287	
2024-07-11T15:20:15Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"version": "0.9.3", "autoscalingrunnerset": {"name":"gm2.small","namespace":"github-arc"}, "name": "gm2.small-zh6h4", "specHash": "75f975979"}
2024-07-11 17:20:15.285	
2024-07-11T15:20:15Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:15.284	
2024-07-11T15:20:15Z	INFO	EphemeralRunnerSet	Scaling comparison	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "current": 23, "desired": 23}
2024-07-11 17:20:15.284	
2024-07-11T15:20:15Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "pending": 0, "running": 1, "finished": 0, "failed": 22, "deleting": 0}
2024-07-11 17:20:15.272	
2024-07-11T15:20:15Z	INFO	EphemeralRunnerSet	Updating status with current runners count	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "count": 23}
2024-07-11 17:20:15.272	
2024-07-11T15:20:15Z	INFO	EphemeralRunnerSet	Scaling comparison	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "current": 23, "desired": 23}
2024-07-11 17:20:15.272	
2024-07-11T15:20:15Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "pending": 0, "running": 1, "finished": 0, "failed": 22, "deleting": 0}
2024-07-11 17:20:15.271	
2024-07-11T15:20:15Z	INFO	EphemeralRunner	Updated ephemeral runner status with pod phase	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:15.257	
2024-07-11T15:20:15Z	INFO	EphemeralRunner	Updating ephemeral runner status with pod phase	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "statusPhase": "Running", "statusReason": "", "statusMessage": ""}
2024-07-11 17:20:15.257	
2024-07-11T15:20:15Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.099	
2024-07-11T15:20:09Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.096	
2024-07-11T15:20:09Z	INFO	EphemeralRunnerSet	Scaling comparison	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "current": 23, "desired": 23}
2024-07-11 17:20:09.096	
2024-07-11T15:20:09Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "pending": 1, "running": 0, "finished": 0, "failed": 22, "deleting": 0}
2024-07-11 17:20:09.096	
2024-07-11T15:20:09Z	INFO	EphemeralRunner	Updated ephemeral runner status with pod phase	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.089	
2024-07-11T15:20:09Z	INFO	EphemeralRunner	Updating ephemeral runner status with pod phase	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "statusPhase": "Pending", "statusReason": "", "statusMessage": ""}
2024-07-11 17:20:09.089	
2024-07-11T15:20:09Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.070	
2024-07-11T15:20:09Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.156	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.153	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.145	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Created ephemeral runner pod	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "runnerScaleSetId": 2, "runnerName": "gm2.small-zh6h4-runner-v9sch", "runnerId": 604044, "configUrl": "https://github.com/tutti-ch", "podName": "gm2.small-zh6h4-runner-v9sch"}
2024-07-11 17:20:02.116	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Created new pod spec for ephemeral runner	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.116	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Creating new pod for ephemeral runner	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.116	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Creating new EphemeralRunner pod.	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}

It's super strange, that pod was running just few seconds and than is deleted.

@mohsen0
Copy link

mohsen0 commented Jul 12, 2024

is there a workaround for this? the issue effectively causes a very poor experience

@keisari-ch
Copy link

experiencing this issue aswell with a new deployment on AKS.
the runner job is never deployed. the only custom thing is the fact that there is a nodeSelector

controller logs

2024-07-22T09:51:19Z    INFO    EphemeralRunnerSet      Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2024-07-22T09:51:19Z    INFO    EphemeralRunnerSet      Scaling comparison      {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "current": 1, "desired": 1}
2024-07-22T09:51:19Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set      {"version": "0.9.3", "autoscalingrunnerset": {"name":"arc-runner-itplat","namespace":"arc-runner-itplat"}, "name": "arc-runner-itplat-jdf4g", "specHash": "84bc4c7fd5"}
2024-07-22T09:52:09Z    INFO    EphemeralRunnerSet      Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2024-07-22T09:52:09Z    INFO    EphemeralRunnerSet      Scaling comparison      {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "current": 1, "desired": 1}
2024-07-22T09:52:09Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set      {"version": "0.9.3", "autoscalingrunnerset": {"name":"arc-runner-itplat","namespace":"arc-runner-itplat"}, "name": "arc-runner-itplat-jdf4g", "specHash": "84bc4c7fd5"}
2024-07-22T09:52:59Z    INFO    EphemeralRunnerSet      Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2024-07-22T09:52:59Z    INFO    EphemeralRunnerSet      Scaling comparison      {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "current": 1, "desired": 1}
2024-07-22T09:52:59Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set      {"version": "0.9.3", "autoscalingrunnerset": {"name":"arc-runner-itplat","namespace":"arc-runner-itplat"}, "name": "arc-runner-itplat-jdf4g", "specHash": "84bc4c7fd5"}

runner logs

2024-07-22T09:51:19Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":0,\"replicas\":1}}"}
2024-07-22T09:51:19Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:51:19Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-07-22T09:52:09Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:52:09Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:52:09Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":1,\"replicas\":1}}"}
2024-07-22T09:52:09Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:52:09Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-07-22T09:52:59Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:52:59Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":2,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:52:59Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":2,\"replicas\":1}}"}
2024-07-22T09:52:59Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:52:59Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-07-22T09:53:49Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:53:49Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":3,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:53:49Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":3,\"replicas\":1}}"}
2024-07-22T09:53:49Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:53:49Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-07-22T09:54:40Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:54:40Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":4,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:54:40Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":4,\"replicas\":1}}"}
2024-07-22T09:54:40Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:54:40Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-07-22T09:55:30Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:55:30Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":5,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:55:30Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":5,\"replicas\":1}}"}
2024-07-22T09:55:30Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:55:30Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}

Totally blocked.

@gwynforthewyn
Copy link
Contributor

We've seen the same issue. We noticed that there are other issues where this issue was discovered in older versions of actions runner controller, and one of them has a comment saying that the issue is fixed by restarting the listener.

This doesn't fix the issue for anyone, but as triage you can deploy a cron job with something like this:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: actions-listener-restart-cron
  namespace: actions
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
         serviceAccountName: your-service-account
         containers:
         - name: actions-listener-restart-pod
           image: bitnami/kubectl:v1.26.14
           imagePullPolicy: Always
           securityContext:
             allowPrivilegeEscalation: false
             capabilities:
               drop: ["ALL"]
             runAsUser: 999
           command:
           - /bin/sh
           - -c
           - kubectl delete pod -n actions -l app.kubernetes.io/component=runner-scale-set-listener
         restartPolicy: Never

You'll also need to define the ServiceAcccount and its role/rolebinding, but the above should get you started.

@krzyzakp
Copy link

On our side, it somehow magically stopped to be an issue over 2 weeks ago. Since that time, no single job was not picked up (except around an hour, when we messed up our networking in AWS and you couldn't reach the Kubernetes cluster - but it was purely problem on network, which we caused).
So for now I can't really debug it more or add recommended restart, as it simply works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

6 participants