Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in Terminating state with RunnerDeployment #1575

Closed
4 tasks done
iochagov opened this issue Jun 28, 2022 · 4 comments
Closed
4 tasks done

Pods stuck in Terminating state with RunnerDeployment #1575

iochagov opened this issue Jun 28, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@iochagov
Copy link

Controller Version

0.24.1

Helm Chart Version

0.19.1

CertManager Version

1.8.0

Deployment Method

Helm

cert-manager installation

  • yes
  • yes

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  annotations:
    meta.helm.sh/release-name: actions-runner-deployment
    meta.helm.sh/release-namespace: actions-runner
  creationTimestamp: "2022-06-19T11:09:09Z"
  generation: 205
  labels:
    app.kubernetes.io/instance: actions-runner-deployment
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: actions-runner-deployment
    app.kubernetes.io/version: 0.1.1
    helm.sh/chart: actions-runner-deployment-0.1.0
  name: fb-staging-runners
  namespace: actions-runner
  resourceVersion: "274043494"
  uid: 6c568e0b-eeb1-431e-8253-eca9058fb783
spec:
  effectiveTime: null
  replicas: 22
  selector:
    matchLabels:
      runner-deployment-name: fb-staging-runners
  template:
    metadata:
      annotations:
        fluentbit.io/exclude: "true"
    spec:
      dockerdContainerResources: {}
      dockerdWithinRunnerContainer: true
      image: ***.dkr.ecr.eu-central-1.amazonaws.com/actions-runner:0.2.4
      imagePullPolicy: IfNotPresent
      labels:
      - fb-staging-runners
      nodeSelector:
        role: github-runners
      organization: AcroCharge
      resources:
        limits:
          cpu: "15"
          memory: 28Gi
        requests:
          cpu: "15"
          memory: 14Gi
      tolerations:
      - effect: NoSchedule
        key: role
        operator: Equal
        value: github-runners
      volumeMounts:
      - mountPath: /mnt/efs
        name: yarn-cache
      volumes:
      - name: yarn-cache
        persistentVolumeClaim:
          claimName: yarn-cache
status:
  availableReplicas: 22
  desiredReplicas: 22
  readyReplicas: 22
  replicas: 29
  updatedReplicas: 29

To Reproduce

This issue is similar to this issue (https://github.com/actions-runner-controller/actions-runner-controller/issues/1369) but for `RunnerDeployment`.

After pod finishes its and is about to vanish it never terminates properly because it is getting stuck in Terminating state. It is impossible to delete this pod even with `--force` argument. The only way to delete it is to remove `finalizers` section form the pod spec.

We also happen to use spot instances, but we use Spot.io autoscaler to control nodes.

Describe the bug

Pods stuck in Terminating state after their job.

Describe the expected behavior

Pods should terminate normally

Controller Logs

Logs of the controller are too long, I will attach them as a file in separate comment.

Runner Pod Logs

**CloudWatch Logs Insights**  
region: eu-central-1  
log-group-names: /aws/containerinsights/staging-2/application  
start-time: -3600s  
end-time: 0s  
query-string:

filter @logStream = 'ip-172-27-131-226.eu-central-1.compute.internal-application.var.log.containers.fb-staging-runners-lx2z8-7mp2l_actions-runner_runner-8f7ab42146ad385cff065bdc843206b0f48990d07595161e9f4ab337786834d7.log'
 | fields @timestamp, @message
 | parse @message '"log":"*"' as log
 | display @timestamp, log

@timestamp log
2022-06-28 10:59:07.593 2022-06-28 10:59:07Z: Listening for Jobs\n
2022-06-28 10:59:07.592 Current runner version: '2.293.0'\n
2022-06-28 10:59:07.036 \n
2022-06-28 10:59:07.036 \u221a Connected to GitHub\n
2022-06-28 10:59:07.036 \n
2022-06-28 10:59:05.627 Warning: '--once' is going to be deprecated in the future, please consider using '--ephemeral' during runner registration.\n
2022-06-28 10:59:05.627 https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#using-ephemeral-runners-for-autoscaling\n
2022-06-28 10:59:00.038 Runner update process finished.\n
2022-06-28 10:58:59.917 Runner will exit shortly for update, should be back online within 10 seconds.\n
2022-06-28 10:58:59.783 Generate and execute update script.\n
2022-06-28 10:58:59.645 Waiting for current job finish running.\n
2022-06-28 10:58:55.326 Downloading 2.293.0 runner\n
2022-06-28 10:58:55.173 Runner update in progress, do not shutdown runner.\n
2022-06-28 10:57:42.163 Current runner version: '2.290.1'\n
2022-06-28 10:57:42.163 2022-06-28 10:57:42Z: Listening for Jobs\n
2022-06-28 10:57:41.463 \n
2022-06-28 10:57:41.463 \n
2022-06-28 10:57:41.463 \u221a Connected to GitHub\n
2022-06-28 10:57:39.959 }Warning: '--once' is going to be deprecated in the future, please consider using '--ephemeral' during runner registration.\n
2022-06-28 10:57:39.959 https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#using-ephemeral-runners-for-autoscaling\n
2022-06-28 10:57:39.707 \u001b[0;33m2022-06-28 10:57:39.706 WARNING --- Passing --once is deprecated and will be removed as an option from the image and actions-runner-controller at the release of 0.24.0. Upgrade to GHES => 3.3 to continue using actions-runner-controller. If you are using github.com ignore this warning.\u001b[0m\n
2022-06-28 10:57:39.705 mv: cannot stat './externalstmp/*': No such file or directory\n
2022-06-28 10:57:39.703 \u001b[0;37m2022-06-28 10:57:39.702 DEBUG --- Runner successfully configured.\u001b[0m\n
2022-06-28 10:57:39.703 \ufeff{\n \
2022-06-28 10:57:39.679 \n
2022-06-28 10:57:39.679 \u221a Settings Saved.\n
2022-06-28 10:57:39.679 \n
2022-06-28 10:57:39.676 \u221a Runner connection is good\n
2022-06-28 10:57:39.676 \n
2022-06-28 10:57:39.676 # Runner settings\n
2022-06-28 10:57:39.676 \n
2022-06-28 10:57:37.813 \u221a Runner successfully added\n
2022-06-28 10:57:37.436 \n
2022-06-28 10:57:37.436 \n
2022-06-28 10:57:37.435 \n
2022-06-28 10:57:36.966 \n
2022-06-28 10:57:36.966 # Runner Registration\n
2022-06-28 10:57:36.966 \n
2022-06-28 10:57:36.623 \n
2022-06-28 10:57:36.623 \u221a Connected to GitHub\n
2022-06-28 10:57:34.948
2022-06-28 10:57:34.948
2022-06-28 10:57:34.948 --------------------------------------------------------------------------------\n
2022-06-28 10:57:34.948 \n
2022-06-28 10:57:34.948 # Authentication\n
2022-06-28 10:57:34.948 \n
2022-06-28 10:57:34.947 \n
2022-06-28 10:57:34.947 --------------------------------------------------------------------------------\n
2022-06-28 10:57:34.947
2022-06-28 10:57:34.947
2022-06-28 10:57:34.947
2022-06-28 10:57:34.947
2022-06-28 10:57:34.947
2022-06-28 10:57:34.947
2022-06-28 10:57:34.741 \u001b[0;37m2022-06-28 10:57:34.740 DEBUG --- Configuring the runner.\u001b[0m\n
2022-06-28 10:57:34.351 \u001b[0;37m2022-06-28 10:57:34.351 DEBUG --- Github endpoint URL https://github.com/\u001b[0m\n
2022-06-28 10:57:34.348 CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n
2022-06-28 10:57:34.329 \u001b[0;37m2022-06-28 10:57:34.329 DEBUG --- Waiting until Docker is available or the timeout is reached\u001b[0m\n
2022-06-28 10:57:34.328 \u001b[0;37m2022-06-28 10:57:34.328 DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled\u001b[0m\n
2022-06-28 10:57:34.325 \u001b[0;37m2022-06-28 10:57:34.325 DEBUG --- dockerd is running\u001b[0m\n
2022-06-28 10:57:33.322 \u001b[0;37m2022-06-28 10:57:33.321 DEBUG --- Waited 1 seconds of 30 seconds\u001b[0m\n
2022-06-28 10:57:33.321 \u001b[0;37m2022-06-28 10:57:33.320 DEBUG --- Process dockerd is not running yet. Retrying in 1 seconds\u001b[0m\n
2022-06-28 10:57:32.317 \u001b[0;37m2022-06-28 10:57:32.316 DEBUG --- Waited 0 seconds of 30 seconds\u001b[0m\n
2022-06-28 10:57:32.315 \u001b[0;37m2022-06-28 10:57:32.315 DEBUG --- Process dockerd is not running yet. Retrying in 1 seconds\u001b[0m\n
2022-06-28 10:57:32.312 \u001b[0;37m2022-06-28 10:57:32.311 DEBUG --- Waiting for processes to be running...\u001b[0m\n
2022-06-28 10:57:32.310 \u001b[0;37m2022-06-28 10:57:32.310 DEBUG --- Starting supervisor daemon\u001b[0m\n
2022-06-28 10:57:32.309 [program:dockerd]\n
2022-06-28 10:57:32.309 command=/usr/local/bin/dockerd\n
2022-06-28 10:57:32.309 autostart=true\n
2022-06-28 10:57:32.309 autorestart=true\n
2022-06-28 10:57:32.309 stderr_logfile=/var/log/dockerd.err.log\n
2022-06-28 10:57:32.309 stdout_logfile=/var/log/dockerd.out.log\n
2022-06-28 10:57:32.309 ---\n
2022-06-28 10:57:32.308 {}\n
2022-06-28 10:57:32.308 ---\n
2022-06-28 10:57:32.308 Using /etc/supervisor/conf.d/dockerd.conf with the following content:\n
2022-06-28 10:57:32.308 ---\n
2022-06-28 10:57:32.307 Using /etc/docker/daemon.json with the following content:\n
2022-06-28 10:57:32.307 ---\n



### Additional Context

_No response_
@iochagov iochagov added the bug Something isn't working label Jun 28, 2022
@iochagov
Copy link
Author

Controller Logs
arc.log

@iochagov
Copy link
Author

As a consequence of this bug Runners page is spammed by unused Runners names:
image

This matters because we cannot collect metrics with such huge number Runners. Prometheus exporter asks API for a list of Runners and it returns a JSON array with ~10000 elements in it.

This is happening because controller tries to terminate Runner, something is not working, the pod hangs indefinitely, Controller thinks the pod is stuck and creates a new Runner with a new name.

@toast-gear
Copy link
Collaborator

toast-gear commented Jul 11, 2022

--once was never fully implemented by GitHub nor supported, your logs indicate you're using the --once flag. We recently even fully removed the ability to use it #1196. I would, as a first step, move over to the supported flag --ephemeral as GitHub fully supports it, a lot of race conditions present in the --once flag were fixed and the runners deregister themselves preventing a build up of old runners.

You should also automate bumping your image from our image as the runner self-update process as been shown to cause long delays in jobs being allocated to runners in some unknown circumstances.

Please switch to the --ephemeral switch and report back the behaviour you're seeing.

@iochagov
Copy link
Author

Thanks @toast-gear, switching on ephemeral and upgrading Runner image to summerwind/actions-runner-dind:v2.294.0-ubuntu-20.04-f661249 helped to solve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants