Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RunnerScaleSet listener fails on all but one cluster #57

Closed
jeffmccune opened this issue Mar 14, 2024 · 0 comments · Fixed by #62
Closed

RunnerScaleSet listener fails on all but one cluster #57

jeffmccune opened this issue Mar 14, 2024 · 0 comments · Fixed by #62

Comments

@jeffmccune
Copy link
Contributor

Tracks actions/actions-runner-controller#3351

jeffmccune added a commit that referenced this issue Mar 15, 2024
The effect of this patch is limited to refreshing credentials only for
namespaces that exist in the local cluster.  There is structure in place
in the CUE code to allow for namespaces bound to specific clusters, but
this is used only by the optional Vault component.

This patch was an attempt to work around
actions/actions-runner-controller#3351 by
deploying the runner scale sets into unique namespaces.

This effort was a waste of time, only one listener pod successfully
registered for a given scale set name / group combination.

Because we have only one group named Default we can only have one
listener pod globally for a given scale set name.

Because we want our workflows to execute regardless of the availability
of a single cluster, we're going to let this fail for now.  The pod
retries every 3 seconds.  When a cluster is destroyed, another cluster
will quickly register.

A follow up patch will look to expand this retry behavior.
jeffmccune added a commit that referenced this issue Mar 15, 2024
This patch fixes the problem of the actions runner scale set listener
pod failing every 3 seconds.  See
actions/actions-runner-controller#3351

The solution is not ideal, if the primary cluster is down workflows will
not execute.  The primary cluster shouldn't go down though so this is
the trade off.  Lower log spam and resource usage by eliminating the
failing pods on other clusters for lower availability if the primary
cluster is not available.

We could let the pods loop and if the primary is unavailable another
would quickly pick up the role, but it doesn't seem worth it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant