Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner Scale Set gets stuck, crash loops every 3 seconds: failed to create session: 409 had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-7db9c9f7-listener #3351

Closed
4 tasks done
jeffmccune opened this issue Mar 14, 2024 · 12 comments
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested

Comments

@jeffmccune
Copy link

Checks

Controller Version

0.8.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install on west coast cluster
2. Install on east coast cluster
3. Create one `AutoscalingRunnerSet` named `gha-rs` in each of the two clusters.
4. Observe one of the listener pods becomes deadlocked with error `Application returned an error: createSession failed: failed to create session: 409 - had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-deadbeef-listener.`

Describe the bug

The listener fails to start.

Describe the expected behavior

The listener should start in both clusters.

Additional Context

The values.yaml used is as close to the upstream documentation as possible.  The only customization is:


values: {
	controllerServiceAccount: name:      "gha-rs-controller"
	controllerServiceAccount: namespace: "arc-system"
	githubConfigSecret: "controller-manager"
	githubConfigUrl:    "https://github.com/myorg"
}


Where the `controller-manager` secret contains GitHub App credentials.

Controller Logs

https://gist.github.com/jeffmccune/e893d4af28727d55979f75fcfddc6536#file-controller-logs-txt

Runner Pod Logs

https://gist.github.com/jeffmccune/e893d4af28727d55979f75fcfddc6536#file-listener-pod-logs-txt
@jeffmccune jeffmccune added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Mar 14, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@jeffmccune
Copy link
Author

This bug may be caused by the hashing method. The pod name is always hashed to the same value regardless of the cluster.

❯ KUBECONFIG=$HOME/.kube/k2 k get pods -n arc-system
NAME                                 READY   STATUS              RESTARTS   AGE
gha-rs-7db9c9f7-listener             0/1     ContainerCreating   0          0s
gha-rs-controller-6897c9bffb-trdc2   1/1     Running             0          2d18h
❯ KUBECONFIG=$HOME/.kube/k3 k get pods -n arc-system
NAME                                 READY   STATUS              RESTARTS   AGE
gha-rs-7db9c9f7-listener             0/1     ContainerCreating   0          1s
gha-rs-controller-6897c9bffb-4jx2m   1/1     Running             0          3d2h
❯ KUBECONFIG=$HOME/.kube/k4 k get pods -n arc-system
NAME                                 READY   STATUS    RESTARTS   AGE
gha-rs-7db9c9f7-listener             1/1     Running   0          25h
gha-rs-controller-6897c9bffb-7gp68   1/1     Running   0          2d19h

@jeffmccune
Copy link
Author

jeffmccune commented Mar 14, 2024

Bug looks to be here.

Workaround is to create different controller namespaces in different clusters, but this unfortunately isn't a permanent solution because it doesn't square with the position of SIG Multicluster.

@nikola-jokic
Copy link
Contributor

Hey @jeffmccune,

Since your scale sets are named the same, do they belong to different runner groups? If not, that is the problem.

@nikola-jokic nikola-jokic added question Further information is requested and removed bug Something isn't working needs triage Requires review from the maintainers labels Mar 15, 2024
@jeffmccune
Copy link
Author

Hi @nikola-jokic thanks for following up.

No, we need to have at least two clusters in the same group for multi-region redundancy. Why is it a problem? There is one scale set spanning N>1 regions.

@jeffmccune
Copy link
Author

jeffmccune commented Mar 15, 2024

What I'm trying to accomplish is to allow dev teams to have a default runs-on target that runs on any available cluster in any available region. We regularly take down entire clusters, so having a dev team target a specific region or cluster isn't ideal, their workflows would fail when we take the cluster down even though other clusters are available to run the workflow.

What's the recommended way to have a workflow target any available scale set in any available cluster (region)?

@nikola-jokic
Copy link
Contributor

Oh you can't have two scale sets with the same name belonging to the same runner group. That is the reason of this report, because that scale set already has a session opened. I hope this document can help you.

@jeffmccune
Copy link
Author

jeffmccune commented Mar 15, 2024 via email

jeffmccune added a commit to holos-run/holos that referenced this issue Mar 15, 2024
The effect of this patch is limited to refreshing credentials only for
namespaces that exist in the local cluster.  There is structure in place
in the CUE code to allow for namespaces bound to specific clusters, but
this is used only by the optional Vault component.

This patch was an attempt to work around
actions/actions-runner-controller#3351 by
deploying the runner scale sets into unique namespaces.

This effort was a waste of time, only one listener pod successfully
registered for a given scale set name / group combination.

Because we have only one group named Default we can only have one
listener pod globally for a given scale set name.

Because we want our workflows to execute regardless of the availability
of a single cluster, we're going to let this fail for now.  The pod
retries every 3 seconds.  When a cluster is destroyed, another cluster
will quickly register.

A follow up patch will look to expand this retry behavior.
jeffmccune added a commit to holos-run/holos that referenced this issue Mar 15, 2024
This patch fixes the problem of the actions runner scale set listener
pod failing every 3 seconds.  See
actions/actions-runner-controller#3351

The solution is not ideal, if the primary cluster is down workflows will
not execute.  The primary cluster shouldn't go down though so this is
the trade off.  Lower log spam and resource usage by eliminating the
failing pods on other clusters for lower availability if the primary
cluster is not available.

We could let the pods loop and if the primary is unavailable another
would quickly pick up the role, but it doesn't seem worth it.
@nikola-jokic
Copy link
Contributor

Sorry if I misunderstood your question, but these are two points I think you are asking:

  1. You want to specify a scale set and deploy it on 3 clusters, for example. Workflow should pick one that is available. If this is the case, you name each scale set the same, but you put it in 3 different runner groups. Then, the scale set that is up and that is the fastest to acquire the job will take it and run it, so even if you take one of your clusters down, two that are left will keep taking jobs
  2. If you want to specify a family of scale sets, for example scale-set-1 and scale-set-2, and you want to specify workflow that can run either on scale-set-1 or scale-set-2, that is not currently possible.

@nikola-jokic
Copy link
Contributor

Closing this one as answered, but feel free to comment on it ☺️

@jeffmccune
Copy link
Author

jeffmccune commented Mar 20, 2024 via email

@nikola-jokic
Copy link
Contributor

No problem, thank you for your feedback! Would you be so kind as to put it in the discussion here? In this discussion, people are expressing their thoughts on our single-label approach, and your feedback would be valuable. Thanks ☺️!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants