-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runner Scale Set gets stuck, crash loops every 3 seconds: failed to create session: 409 had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-7db9c9f7-listener #3351
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
This bug may be caused by the hashing method. The pod name is always hashed to the same value regardless of the cluster.
|
Bug looks to be here. Workaround is to create different controller namespaces in different clusters, but this unfortunately isn't a permanent solution because it doesn't square with the position of SIG Multicluster. |
Hey @jeffmccune, Since your scale sets are named the same, do they belong to different runner groups? If not, that is the problem. |
Hi @nikola-jokic thanks for following up. No, we need to have at least two clusters in the same group for multi-region redundancy. Why is it a problem? There is one scale set spanning N>1 regions. |
What I'm trying to accomplish is to allow dev teams to have a default runs-on target that runs on any available cluster in any available region. We regularly take down entire clusters, so having a dev team target a specific region or cluster isn't ideal, their workflows would fail when we take the cluster down even though other clusters are available to run the workflow. What's the recommended way to have a workflow target any available scale set in any available cluster (region)? |
Oh you can't have two scale sets with the same name belonging to the same runner group. That is the reason of this report, because that scale set already has a session opened. I hope this document can help you. |
Previously it was straightforward to spin up N>1 clusters with self hosted
runners with a label of "self-hosted" and jobs would execute globally.
There was no unnecessary coordination between teams of, "cluster X is going
down for maintenance, update your workflows."
How can this similar level of availability be achieved? Is there a way to
configure the workflow to run on any one of X, Y, or Z runner sets?
|
The effect of this patch is limited to refreshing credentials only for namespaces that exist in the local cluster. There is structure in place in the CUE code to allow for namespaces bound to specific clusters, but this is used only by the optional Vault component. This patch was an attempt to work around actions/actions-runner-controller#3351 by deploying the runner scale sets into unique namespaces. This effort was a waste of time, only one listener pod successfully registered for a given scale set name / group combination. Because we have only one group named Default we can only have one listener pod globally for a given scale set name. Because we want our workflows to execute regardless of the availability of a single cluster, we're going to let this fail for now. The pod retries every 3 seconds. When a cluster is destroyed, another cluster will quickly register. A follow up patch will look to expand this retry behavior.
This patch fixes the problem of the actions runner scale set listener pod failing every 3 seconds. See actions/actions-runner-controller#3351 The solution is not ideal, if the primary cluster is down workflows will not execute. The primary cluster shouldn't go down though so this is the trade off. Lower log spam and resource usage by eliminating the failing pods on other clusters for lower availability if the primary cluster is not available. We could let the pods loop and if the primary is unavailable another would quickly pick up the role, but it doesn't seem worth it.
Sorry if I misunderstood your question, but these are two points I think you are asking:
|
Closing this one as answered, but feel free to comment on it |
Thanks for taking the time to answer this. You're correct about my two
questions. The only comment I have is that I'm a bit frustrated because
with scale sets, groups are required to achieve the same behavior that was
previously supported with labels. My frustration is that groups are a paid
enterprise feature but labels are not, so this feels like a step backwards.
I'm a paying customer, but some of the github orgs I work with are not, so
they cannot use groups and as a result cannot deploy highly available scale
sets. Please consider adding back some mechanism to have a workflow
execute on any available cluster without requiring a paid feature like was
possible previously. Thanks again for your time responding to the question.
|
No problem, thank you for your feedback! Would you be so kind as to put it in the discussion here? In this discussion, people are expressing their thoughts on our single-label approach, and your feedback would be valuable. Thanks |
Checks
Controller Version
0.8.3
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
The listener fails to start.
Describe the expected behavior
The listener should start in both clusters.
Additional Context
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: