Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes mode terminating early or not terminating at all. #3578

Open
4 tasks done
Josh-Engle opened this issue Jun 5, 2024 · 9 comments
Open
4 tasks done

Kubernetes mode terminating early or not terminating at all. #3578

Josh-Engle opened this issue Jun 5, 2024 · 9 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested

Comments

@Josh-Engle
Copy link

Checks

Controller Version

0.9.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Create a workflow with a job that sleeps for 7 minutes and uses a ARC runner scale set configured in kubernetes mode.

Run the workflow.

Describe the bug

Even if the workflow should have slept for 7 minutes, it completes successfully after 4 minutes OR it never completes.

Terminating Early:
image

Never Terminating:
image

Describe the expected behavior

The workflow should have completed only after the sleep command completed.

Additional Context

values:
  - containerMode:
      kubernetesModeWorkVolumeClaim:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
        storageClassName: gh-runner-class
      type: kubernetes
    controllerServiceAccount:
      name: arc-gha-rs-controller
      namespace: arc-systems
    githubConfigSecret: sandbox
    githubConfigUrl: https://github.com/****
    githubServerTLS:
      certificateFrom:
        configMapKeyRef:
          key: ca.crt
          name: crt
      runnerMountPath: /usr/local/share/ca-certificates/
    maxRunners: 200
    minRunners: 5
    proxy:
      http:
        url: '*******'
      https:
        url: '*******'
      noProxy:
      - '*******'
      - '*******'
      - '*******'
      - '*******'
      - '*******'
    runnerGroup: preprod
    runnerScaleSetName: sandbox
    template:
      spec:
        containers:
        - command:
          - /home/runner/run.sh
          env:
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "false"
          image: ghcr.io/actions/actions-runner:latest
          name: runner
        securityContext:
          fsGroup: 123

Controller Logs

[actions-runner-controller logs](https://gist.github.com/Josh-Engle/ea789d7eae1ca30e472fd90fba86bece)

Runner Pod Logs

[Runner Pod Logs](https://gist.github.com/Josh-Engle/86fcca504a8c3175e87b1a1efbd3f794)

### kubectl describe output on job that never terminated.

Name:             ****************.k8s-qrp6d-runner-9f8xw
Namespace:        arc-runners
Priority:         0
Service Account:  *************.k8s-gha-rs-kube-mode
Node:             aks-runnerpool-35648369-vmss000015/10.139.31.13
Start Time:       Wed, 05 Jun 2024 11:48:26 -0700
Labels:           actions-ephemeral-runner=True
                  actions.github.com/organization=********************
                  actions.github.com/scale-set-name=****************
                  actions.github.com/scale-set-namespace=arc-runners
                  app.kubernetes.io/component=runner
                  app.kubernetes.io/part-of=gha-runner-scale-set
                  app.kubernetes.io/version=0.9.0
                  pod-template-hash=67bb8777b7
Annotations:      actions.github.com/patch-id: 70
                  actions.github.com/runner-group-name: *****************
                  actions.github.com/runner-scale-set-name: ******************
                  actions.github.com/runner-spec-hash: cdff6bcb
Status:           Running
IP:               10.244.0.76
IPs:
  IP:           10.244.0.76
Controlled By:  EphemeralRunner/*******************.k8s-qrp6d-runner-9f8xw
Containers:
  runner:
    Container ID:  containerd://71795364d03d6e29cc0a7d40623120040354ab416e7139528b39841c164d47f1
    Image:         ghcr.io/actions/actions-runner:latest
    Image ID:      ghcr.io/actions/actions-runner@sha256:95db6fbb020b9f734e8a00389291dae766f0e6ad3d1171ae2d68e9ad8ac4a985
    Port:          <none>
    Host Port:     <none>
    Command:
      /home/runner/run.sh
    State:          Running
      Started:      Wed, 05 Jun 2024 11:48:50 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER:    false
      ACTIONS_RUNNER_CONTAINER_HOOKS:          /home/runner/k8s/index.js
      ACTIONS_RUNNER_POD_NAME:                 *********************.k8s-qrp6d-runner-9f8xw (v1:metadata.name)
      NODE_EXTRA_CA_CERTS:                     /usr/local/share/ca-certificates/ca.crt
      RUNNER_UPDATE_CA_CERTS:                  1
      ACTIONS_RUNNER_INPUT_JITCONFIG:          <set to the key 'jitToken' in secret '*************.k8s-qrp6d-runner-9f8xw'>  Optional: false
      GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT:  actions-runner-controller/0.9.0
      http_proxy:                              <set to the key 'http_proxy' in secret '*************.k8s-qrp6d-754b578d-runner-proxy'>   Optional: false
      https_proxy:                             <set to the key 'https_proxy' in secret '*************.k8s-qrp6d-754b578d-runner-proxy'>  Optional: false
      no_proxy:                                <set to the key 'no_proxy' in secret '*************.k8s-qrp6d-754b578d-runner-proxy'>     Optional: false
    Mounts:
      /home/runner/_work from work (rw)
      /usr/local/share/ca-certificates/ca.crt from github-server-tls-cert (rw,path="ca.crt")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c6wc6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  github-server-tls-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      crt
    Optional:  false
  work:
    Type:          EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod)
    StorageClass:  gh-runner-class
    Volume:        
    Labels:            <none>
    Annotations:       <none>
    Capacity:      
    Access Modes:  
    VolumeMode:    Filesystem
  kube-api-access-c6wc6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
@Josh-Engle Josh-Engle added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jun 5, 2024
Copy link
Contributor

github-actions bot commented Jun 5, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic
Copy link
Contributor

Hey @Josh-Engle,

Can you please upgrade to the 0.9.2 and let us know if you are seeing this issue? The 0.9.0 and 0.9.1 can potentially kill the runner before it finishes the job, so I'm wondering if that is the source of the issue.
If you confirm that it is still happening on 0.9.2, can you please provide a runner log for the job that exits prematurely, and the one that never exits? I failed to reproduce this behavior on 0.9.2, so I'm wondering if the fix we introduced to the controller eliminated this problem.

@nikola-jokic nikola-jokic added question Further information is requested and removed needs triage Requires review from the maintainers labels Jun 17, 2024
@Josh-Engle
Copy link
Author

@nikola-jokic I duplicated the issue in 0.9.0 and 0.9.1 but it went away 0.9.2 so you're spot on. Thanks for the help!

@nikola-jokic
Copy link
Contributor

Thank you for letting us know!

@genesis-jamin
Copy link

Hi @nikola-jokic , could you link the PR that fixes this issue? I'm just curious to see what it was

@genesis-jamin
Copy link

genesis-jamin commented Jul 10, 2024

Hi @nikola-jokic , we are actually seeing this error on 0.9.3 where the job exits prematurely with exit code 0. We first started noticing this in our pytest steps where the step would finish with Completed status but before all tests had been run. I was able to reproduce this behavior with a step that runs sleep 6000:

image

From the image above we can see that it exits in 5m 29s instead.

Here is a gist to the runner log: https://gist.github.com/genesis-jamin/774d115df441c3afdd755f73a3c499dc

You can grep the logs "Finished process 170 with exit code 0" to see where the sleep 6000 step ends. Let me know if you need anything else to help debug.

@Josh-Engle
Copy link
Author

I'm re-opening as we are actually still seeing this issue on 0.9.2. Our workflows for this issue were working correctly immediately after the upgrade but they are now exhibiting the same behavior as before.

@Josh-Engle Josh-Engle reopened this Jul 10, 2024
@genesis-jamin
Copy link

Someone else also reported the same issue in the container hooks repo: actions/runner-container-hooks#165

I'm going to copy my comment to that issue as well for posterity's sake

@genesis-jamin
Copy link

We were able to root cause this -- turns out it's related to our k8s cluster setup. Our k8s cluster is hosted on GKE, and we noticed that every time a Github step would terminate early, it happened right after the cluster scaled down and evicted some konnectivity-agent pod. I am not a k8s networking expert myself but it seems that this pod is responsible for maintaining the websocket connection used when exec'ing into a pod (which is what the runner does to run commands inside the workflow pod).

We were able to someone mitigate this issue by adding taints / tolerations so that konnectivity-agent pods didn't run on our pytest node pool (which is scaling up / down frequently). This helps us avoid the case when a konnectivity-agent pod is evicted due to the node scaling down, but does not solve the case when the konnectivity autoscaler scales down the number of replicas.

Another option for us is to disable autoscaling, but that defeats the purpose of using ARC in the first place 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants