-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RunnerSet Reliability Improvements #1167
Conversation
This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.
…faults to their own source
…e them distinguishable from other annotations
…e more consistent
Merging anyway. But I'll keep testing both RunnerDeployment and RunnerSet and perhaps submit a PR or two to tweak details and fix anything that can be considered as a regression! |
Hi @mumoshu, I'd be happy to test these improvements on our live cluster (which has been suffering from the issues you're trying to solve here) Do you provide nightly builds docker images of the controller by any chance ? |
@msessa Hey! Thanks for your help. I thought we'd been building controller images tagged with |
…er-and-runnerset Refactor Runner and RunnerSet so that they use the same library code that powers RunnerSet. RunnerSet is StatefulSet-based and RunnerSet/Runner is Pod-based so it had been hard to unify the implementation although they look very similar in many aspects. This change finally resolves that issue, by first introducing a library that implements the generic logic that is used to reconcile RunnerSet, then adding an adapter that can be used to let the generic logic manage runner pods via Runner, instead of via StatefulSet. Follow-up for #1127, #1167, and 1178
Hi @mumoshu, I've tried to run the canary images on my cluster ( arm64 architecture ) but unfortunately I stumbled on this issue when launching the webhook server. |
Just checking in on this one, after running the canary build for a few days in our live environment I can confirm this change has has a tremendous positive impact on the reliability of our runners especially with long-running jobs. Even when running ARC with fairly aggressive scaling settings ( The only thing I've encountered that is a bit off is that, when a runner pod is executing a long job and it ends up reaching the scale-down delay, the controller spams the log with several of these messages:
I understand that is completely normal behavior at this point, so perhaps the error level should be lowered to |
@msessa Hey! Thanks for the positive report. That really encourages me as I was still afraid if it can be just "works on my(@mumoshu's) environment" 😄 Re log spam, that's really great feedback! |
Looking forward to this new code making the next release :) meanwhile I will continue running the canary and monitor for issues. Thanks for your amazing work mate! |
Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy. There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job. This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter. The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before. Ref #1167 (comment)
Based on my experience fixing RunnerDeployment reliability (as a part of #1127), I'm going to add several enhancements to the RunnerSet and RunnerPod controllers to make RunnerSets even more reliable.
EffectiveTime
field that is the same as RunnerDeployment's, which is used to prevent unnecessarily recreating completed runner pods when it's being scaled down quickly by webhookworkflow_job
completion events.terminationGracePeriodSeconds
.Ref #920