Runner Scale Set gets stuck, crash loops every 3 seconds: failed to create session: 409 had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-7db9c9f7-listener #3351

jeffmccune · 2024-03-14T17:55:18Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.8.3

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install on west coast cluster
2. Install on east coast cluster
3. Create one `AutoscalingRunnerSet` named `gha-rs` in each of the two clusters.
4. Observe one of the listener pods becomes deadlocked with error `Application returned an error: createSession failed: failed to create session: 409 - had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-deadbeef-listener.`

Describe the bug

The listener fails to start.

Describe the expected behavior

The listener should start in both clusters.

Additional Context

The values.yaml used is as close to the upstream documentation as possible.  The only customization is:


values: {
	controllerServiceAccount: name:      "gha-rs-controller"
	controllerServiceAccount: namespace: "arc-system"
	githubConfigSecret: "controller-manager"
	githubConfigUrl:    "https://github.com/myorg"
}


Where the `controller-manager` secret contains GitHub App credentials.

Controller Logs

https://gist.github.com/jeffmccune/e893d4af28727d55979f75fcfddc6536#file-controller-logs-txt

Runner Pod Logs

https://gist.github.com/jeffmccune/e893d4af28727d55979f75fcfddc6536#file-listener-pod-logs-txt

github-actions · 2024-03-14T17:55:43Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

jeffmccune · 2024-03-14T18:17:25Z

This bug may be caused by the hashing method. The pod name is always hashed to the same value regardless of the cluster.

❯ KUBECONFIG=$HOME/.kube/k2 k get pods -n arc-system
NAME                                 READY   STATUS              RESTARTS   AGE
gha-rs-7db9c9f7-listener             0/1     ContainerCreating   0          0s
gha-rs-controller-6897c9bffb-trdc2   1/1     Running             0          2d18h
❯ KUBECONFIG=$HOME/.kube/k3 k get pods -n arc-system
NAME                                 READY   STATUS              RESTARTS   AGE
gha-rs-7db9c9f7-listener             0/1     ContainerCreating   0          1s
gha-rs-controller-6897c9bffb-4jx2m   1/1     Running             0          3d2h
❯ KUBECONFIG=$HOME/.kube/k4 k get pods -n arc-system
NAME                                 READY   STATUS    RESTARTS   AGE
gha-rs-7db9c9f7-listener             1/1     Running   0          25h
gha-rs-controller-6897c9bffb-7gp68   1/1     Running   0          2d19h

jeffmccune · 2024-03-14T23:18:02Z

Bug looks to be here.

Workaround is to create different controller namespaces in different clusters, but this unfortunately isn't a permanent solution because it doesn't square with the position of SIG Multicluster.

nikola-jokic · 2024-03-15T10:56:10Z

Hey @jeffmccune,

Since your scale sets are named the same, do they belong to different runner groups? If not, that is the problem.

jeffmccune · 2024-03-15T16:51:19Z

Hi @nikola-jokic thanks for following up.

No, we need to have at least two clusters in the same group for multi-region redundancy. Why is it a problem? There is one scale set spanning N>1 regions.

jeffmccune · 2024-03-15T17:04:38Z

What I'm trying to accomplish is to allow dev teams to have a default runs-on target that runs on any available cluster in any available region. We regularly take down entire clusters, so having a dev team target a specific region or cluster isn't ideal, their workflows would fail when we take the cluster down even though other clusters are available to run the workflow.

What's the recommended way to have a workflow target any available scale set in any available cluster (region)?

nikola-jokic · 2024-03-15T17:17:01Z

Oh you can't have two scale sets with the same name belonging to the same runner group. That is the reason of this report, because that scale set already has a session opened. I hope this document can help you.

jeffmccune · 2024-03-15T18:19:06Z

Previously it was straightforward to spin up N>1 clusters with self hosted runners with a label of "self-hosted" and jobs would execute globally. There was no unnecessary coordination between teams of, "cluster X is going down for maintenance, update your workflows." How can this similar level of availability be achieved? Is there a way to configure the workflow to run on any one of X, Y, or Z runner sets?

The effect of this patch is limited to refreshing credentials only for namespaces that exist in the local cluster. There is structure in place in the CUE code to allow for namespaces bound to specific clusters, but this is used only by the optional Vault component. This patch was an attempt to work around actions/actions-runner-controller#3351 by deploying the runner scale sets into unique namespaces. This effort was a waste of time, only one listener pod successfully registered for a given scale set name / group combination. Because we have only one group named Default we can only have one listener pod globally for a given scale set name. Because we want our workflows to execute regardless of the availability of a single cluster, we're going to let this fail for now. The pod retries every 3 seconds. When a cluster is destroyed, another cluster will quickly register. A follow up patch will look to expand this retry behavior.

This patch fixes the problem of the actions runner scale set listener pod failing every 3 seconds. See actions/actions-runner-controller#3351 The solution is not ideal, if the primary cluster is down workflows will not execute. The primary cluster shouldn't go down though so this is the trade off. Lower log spam and resource usage by eliminating the failing pods on other clusters for lower availability if the primary cluster is not available. We could let the pods loop and if the primary is unavailable another would quickly pick up the role, but it doesn't seem worth it.

nikola-jokic · 2024-03-18T12:49:24Z

Sorry if I misunderstood your question, but these are two points I think you are asking:

You want to specify a scale set and deploy it on 3 clusters, for example. Workflow should pick one that is available. If this is the case, you name each scale set the same, but you put it in 3 different runner groups. Then, the scale set that is up and that is the fastest to acquire the job will take it and run it, so even if you take one of your clusters down, two that are left will keep taking jobs
If you want to specify a family of scale sets, for example scale-set-1 and scale-set-2, and you want to specify workflow that can run either on scale-set-1 or scale-set-2, that is not currently possible.

nikola-jokic · 2024-03-20T12:31:42Z

Closing this one as answered, but feel free to comment on it ☺️

jeffmccune · 2024-03-20T18:08:25Z

Thanks for taking the time to answer this. You're correct about my two questions. The only comment I have is that I'm a bit frustrated because with scale sets, groups are required to achieve the same behavior that was previously supported with labels. My frustration is that groups are a paid enterprise feature but labels are not, so this feels like a step backwards. I'm a paying customer, but some of the github orgs I work with are not, so they cannot use groups and as a result cannot deploy highly available scale sets. Please consider adding back some mechanism to have a workflow execute on any available cluster without requiring a paid feature like was possible previously. Thanks again for your time responding to the question.

nikola-jokic · 2024-03-21T08:23:48Z

No problem, thank you for your feedback! Would you be so kind as to put it in the discussion here? In this discussion, people are expressing their thoughts on our single-label approach, and your feedback would be valuable. Thanks ☺️!

jeffmccune added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Mar 14, 2024

jeffmccune mentioned this issue Mar 14, 2024

RunnerScaleSet listener fails on all but one cluster holos-run/holos#57

Closed

nikola-jokic added question Further information is requested and removed bug Something isn't working needs triage Requires review from the maintainers labels Mar 15, 2024

jeffmccune mentioned this issue Mar 15, 2024

(#57) Run gha-rs scale set only on the primary cluster holos-run/holos#62

Merged

nikola-jokic closed this as completed Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner Scale Set gets stuck, crash loops every 3 seconds: failed to create session: 409 had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-7db9c9f7-listener #3351

Runner Scale Set gets stuck, crash loops every 3 seconds: failed to create session: 409 had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-7db9c9f7-listener #3351

jeffmccune commented Mar 14, 2024

github-actions bot commented Mar 14, 2024

jeffmccune commented Mar 14, 2024

jeffmccune commented Mar 14, 2024 •

edited

Loading

nikola-jokic commented Mar 15, 2024

jeffmccune commented Mar 15, 2024

jeffmccune commented Mar 15, 2024 •

edited

Loading

nikola-jokic commented Mar 15, 2024

jeffmccune commented Mar 15, 2024 via email

nikola-jokic commented Mar 18, 2024

nikola-jokic commented Mar 20, 2024

jeffmccune commented Mar 20, 2024 via email

nikola-jokic commented Mar 21, 2024

Runner Scale Set gets stuck, crash loops every 3 seconds: failed to create session: 409 had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-7db9c9f7-listener #3351

Runner Scale Set gets stuck, crash loops every 3 seconds: failed to create session: 409 had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-7db9c9f7-listener #3351

Comments

jeffmccune commented Mar 14, 2024

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

github-actions bot commented Mar 14, 2024

jeffmccune commented Mar 14, 2024

jeffmccune commented Mar 14, 2024 • edited Loading

nikola-jokic commented Mar 15, 2024

jeffmccune commented Mar 15, 2024

jeffmccune commented Mar 15, 2024 • edited Loading

nikola-jokic commented Mar 15, 2024

jeffmccune commented Mar 15, 2024 via email

nikola-jokic commented Mar 18, 2024

nikola-jokic commented Mar 20, 2024

jeffmccune commented Mar 20, 2024 via email

nikola-jokic commented Mar 21, 2024

jeffmccune commented Mar 14, 2024 •

edited

Loading

jeffmccune commented Mar 15, 2024 •

edited

Loading