Pods stuck in Terminating state with `RunnerDeployment` #1575

iochagov · 2022-06-28T11:44:35Z

Controller Version

0.24.1

Helm Chart Version

0.19.1

CertManager Version

1.8.0

Deployment Method

Helm

cert-manager installation

yes
yes

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
My actions-runner-controller version (v0.x.y) does support the feature
I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  annotations:
    meta.helm.sh/release-name: actions-runner-deployment
    meta.helm.sh/release-namespace: actions-runner
  creationTimestamp: "2022-06-19T11:09:09Z"
  generation: 205
  labels:
    app.kubernetes.io/instance: actions-runner-deployment
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: actions-runner-deployment
    app.kubernetes.io/version: 0.1.1
    helm.sh/chart: actions-runner-deployment-0.1.0
  name: fb-staging-runners
  namespace: actions-runner
  resourceVersion: "274043494"
  uid: 6c568e0b-eeb1-431e-8253-eca9058fb783
spec:
  effectiveTime: null
  replicas: 22
  selector:
    matchLabels:
      runner-deployment-name: fb-staging-runners
  template:
    metadata:
      annotations:
        fluentbit.io/exclude: "true"
    spec:
      dockerdContainerResources: {}
      dockerdWithinRunnerContainer: true
      image: ***.dkr.ecr.eu-central-1.amazonaws.com/actions-runner:0.2.4
      imagePullPolicy: IfNotPresent
      labels:
      - fb-staging-runners
      nodeSelector:
        role: github-runners
      organization: AcroCharge
      resources:
        limits:
          cpu: "15"
          memory: 28Gi
        requests:
          cpu: "15"
          memory: 14Gi
      tolerations:
      - effect: NoSchedule
        key: role
        operator: Equal
        value: github-runners
      volumeMounts:
      - mountPath: /mnt/efs
        name: yarn-cache
      volumes:
      - name: yarn-cache
        persistentVolumeClaim:
          claimName: yarn-cache
status:
  availableReplicas: 22
  desiredReplicas: 22
  readyReplicas: 22
  replicas: 29
  updatedReplicas: 29

To Reproduce

This issue is similar to this issue (https://github.com/actions-runner-controller/actions-runner-controller/issues/1369) but for `RunnerDeployment`.

After pod finishes its and is about to vanish it never terminates properly because it is getting stuck in Terminating state. It is impossible to delete this pod even with `--force` argument. The only way to delete it is to remove `finalizers` section form the pod spec.

We also happen to use spot instances, but we use Spot.io autoscaler to control nodes.

Describe the bug

Pods stuck in Terminating state after their job.

Describe the expected behavior

Pods should terminate normally

Controller Logs

Logs of the controller are too long, I will attach them as a file in separate comment.

Runner Pod Logs

**CloudWatch Logs Insights**  
region: eu-central-1  
log-group-names: /aws/containerinsights/staging-2/application  
start-time: -3600s  
end-time: 0s  
query-string:

filter @logStream = 'ip-172-27-131-226.eu-central-1.compute.internal-application.var.log.containers.fb-staging-runners-lx2z8-7mp2l_actions-runner_runner-8f7ab42146ad385cff065bdc843206b0f48990d07595161e9f4ab337786834d7.log'
 | fields @timestamp, @message
 | parse @message '"log":"*"' as log
 | display @timestamp, log

@timestamp	log
2022-06-28 10:59:07.593	2022-06-28 10:59:07Z: Listening for Jobs\n
2022-06-28 10:59:07.592	Current runner version: '2.293.0'\n
2022-06-28 10:59:07.036	\n
2022-06-28 10:59:07.036	\u221a Connected to GitHub\n
2022-06-28 10:59:07.036	\n
2022-06-28 10:59:05.627	Warning: '--once' is going to be deprecated in the future, please consider using '--ephemeral' during runner registration.\n
2022-06-28 10:59:05.627	https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#using-ephemeral-runners-for-autoscaling\n
2022-06-28 10:59:00.038	Runner update process finished.\n
2022-06-28 10:58:59.917	Runner will exit shortly for update, should be back online within 10 seconds.\n
2022-06-28 10:58:59.783	Generate and execute update script.\n
2022-06-28 10:58:59.645	Waiting for current job finish running.\n
2022-06-28 10:58:55.326	Downloading 2.293.0 runner\n
2022-06-28 10:58:55.173	Runner update in progress, do not shutdown runner.\n
2022-06-28 10:57:42.163	Current runner version: '2.290.1'\n
2022-06-28 10:57:42.163	2022-06-28 10:57:42Z: Listening for Jobs\n
2022-06-28 10:57:41.463	\n
2022-06-28 10:57:41.463	\n
2022-06-28 10:57:41.463	\u221a Connected to GitHub\n
2022-06-28 10:57:39.959	}Warning: '--once' is going to be deprecated in the future, please consider using '--ephemeral' during runner registration.\n
2022-06-28 10:57:39.959	https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#using-ephemeral-runners-for-autoscaling\n
2022-06-28 10:57:39.707	\u001b[0;33m2022-06-28 10:57:39.706 WARNING --- Passing --once is deprecated and will be removed as an option from the image and actions-runner-controller at the release of 0.24.0. Upgrade to GHES => 3.3 to continue using actions-runner-controller. If you are using github.com ignore this warning.\u001b[0m\n
2022-06-28 10:57:39.705	mv: cannot stat './externalstmp/*': No such file or directory\n
2022-06-28 10:57:39.703	\u001b[0;37m2022-06-28 10:57:39.702 DEBUG --- Runner successfully configured.\u001b[0m\n
2022-06-28 10:57:39.703	\ufeff{\n \
2022-06-28 10:57:39.679	\n
2022-06-28 10:57:39.679	\u221a Settings Saved.\n
2022-06-28 10:57:39.679	\n
2022-06-28 10:57:39.676	\u221a Runner connection is good\n
2022-06-28 10:57:39.676	\n
2022-06-28 10:57:39.676	# Runner settings\n
2022-06-28 10:57:39.676	\n
2022-06-28 10:57:37.813	\u221a Runner successfully added\n
2022-06-28 10:57:37.436	\n
2022-06-28 10:57:37.436	\n
2022-06-28 10:57:37.435	\n
2022-06-28 10:57:36.966	\n
2022-06-28 10:57:36.966	# Runner Registration\n
2022-06-28 10:57:36.966	\n
2022-06-28 10:57:36.623	\n
2022-06-28 10:57:36.623	\u221a Connected to GitHub\n
2022-06-28 10:57:34.948
2022-06-28 10:57:34.948
2022-06-28 10:57:34.948	--------------------------------------------------------------------------------\n
2022-06-28 10:57:34.948	\n
2022-06-28 10:57:34.948	# Authentication\n
2022-06-28 10:57:34.948	\n
2022-06-28 10:57:34.947	\n
2022-06-28 10:57:34.947	--------------------------------------------------------------------------------\n
2022-06-28 10:57:34.947
2022-06-28 10:57:34.947
2022-06-28 10:57:34.947
2022-06-28 10:57:34.947
2022-06-28 10:57:34.947
2022-06-28 10:57:34.947
2022-06-28 10:57:34.741	\u001b[0;37m2022-06-28 10:57:34.740 DEBUG --- Configuring the runner.\u001b[0m\n
2022-06-28 10:57:34.351	\u001b[0;37m2022-06-28 10:57:34.351 DEBUG --- Github endpoint URL https://github.com/\u001b[0m\n
2022-06-28 10:57:34.348	CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n
2022-06-28 10:57:34.329	\u001b[0;37m2022-06-28 10:57:34.329 DEBUG --- Waiting until Docker is available or the timeout is reached\u001b[0m\n
2022-06-28 10:57:34.328	\u001b[0;37m2022-06-28 10:57:34.328 DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled\u001b[0m\n
2022-06-28 10:57:34.325	\u001b[0;37m2022-06-28 10:57:34.325 DEBUG --- dockerd is running\u001b[0m\n
2022-06-28 10:57:33.322	\u001b[0;37m2022-06-28 10:57:33.321 DEBUG --- Waited 1 seconds of 30 seconds\u001b[0m\n
2022-06-28 10:57:33.321	\u001b[0;37m2022-06-28 10:57:33.320 DEBUG --- Process dockerd is not running yet. Retrying in 1 seconds\u001b[0m\n
2022-06-28 10:57:32.317	\u001b[0;37m2022-06-28 10:57:32.316 DEBUG --- Waited 0 seconds of 30 seconds\u001b[0m\n
2022-06-28 10:57:32.315	\u001b[0;37m2022-06-28 10:57:32.315 DEBUG --- Process dockerd is not running yet. Retrying in 1 seconds\u001b[0m\n
2022-06-28 10:57:32.312	\u001b[0;37m2022-06-28 10:57:32.311 DEBUG --- Waiting for processes to be running...\u001b[0m\n
2022-06-28 10:57:32.310	\u001b[0;37m2022-06-28 10:57:32.310 DEBUG --- Starting supervisor daemon\u001b[0m\n
2022-06-28 10:57:32.309	[program:dockerd]\n
2022-06-28 10:57:32.309	command=/usr/local/bin/dockerd\n
2022-06-28 10:57:32.309	autostart=true\n
2022-06-28 10:57:32.309	autorestart=true\n
2022-06-28 10:57:32.309	stderr_logfile=/var/log/dockerd.err.log\n
2022-06-28 10:57:32.309	stdout_logfile=/var/log/dockerd.out.log\n
2022-06-28 10:57:32.309	---\n
2022-06-28 10:57:32.308	{}\n
2022-06-28 10:57:32.308	---\n
2022-06-28 10:57:32.308	Using /etc/supervisor/conf.d/dockerd.conf with the following content:\n
2022-06-28 10:57:32.308	---\n
2022-06-28 10:57:32.307	Using /etc/docker/daemon.json with the following content:\n
2022-06-28 10:57:32.307	---\n



### Additional Context

_No response_

The text was updated successfully, but these errors were encountered:

iochagov · 2022-06-28T11:45:08Z

Controller Logs
arc.log

iochagov · 2022-07-11T14:49:49Z

As a consequence of this bug Runners page is spammed by unused Runners names:

This matters because we cannot collect metrics with such huge number Runners. Prometheus exporter asks API for a list of Runners and it returns a JSON array with ~10000 elements in it.

This is happening because controller tries to terminate Runner, something is not working, the pod hangs indefinitely, Controller thinks the pod is stuck and creates a new Runner with a new name.

toast-gear · 2022-07-11T14:54:54Z

--once was never fully implemented by GitHub nor supported, your logs indicate you're using the --once flag. We recently even fully removed the ability to use it #1196. I would, as a first step, move over to the supported flag --ephemeral as GitHub fully supports it, a lot of race conditions present in the --once flag were fixed and the runners deregister themselves preventing a build up of old runners.

You should also automate bumping your image from our image as the runner self-update process as been shown to cause long delays in jobs being allocated to runners in some unknown circumstances.

Please switch to the --ephemeral switch and report back the behaviour you're seeing.

iochagov · 2022-07-11T17:02:38Z

Thanks @toast-gear, switching on ephemeral and upgrading Runner image to summerwind/actions-runner-dind:v2.294.0-ubuntu-20.04-f661249 helped to solve the issue.

iochagov added the bug Something isn't working label Jun 28, 2022

iochagov closed this as completed Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods stuck in Terminating state with `RunnerDeployment` #1575

Pods stuck in Terminating state with `RunnerDeployment` #1575

iochagov commented Jun 28, 2022

iochagov commented Jun 28, 2022

iochagov commented Jul 11, 2022

toast-gear commented Jul 11, 2022 •

edited

Loading

iochagov commented Jul 11, 2022

Pods stuck in Terminating state with RunnerDeployment #1575

Pods stuck in Terminating state with RunnerDeployment #1575

Comments

iochagov commented Jun 28, 2022

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Controller Logs

Runner Pod Logs

iochagov commented Jun 28, 2022

iochagov commented Jul 11, 2022

toast-gear commented Jul 11, 2022 • edited Loading

iochagov commented Jul 11, 2022

Pods stuck in Terminating state with `RunnerDeployment` #1575

Pods stuck in Terminating state with `RunnerDeployment` #1575

toast-gear commented Jul 11, 2022 •

edited

Loading