Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RunnerSet Reliability Improvements #1167

Merged
merged 10 commits into from
Mar 2, 2022
Merged

RunnerSet Reliability Improvements #1167

merged 10 commits into from
Mar 2, 2022

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Mar 1, 2022

Based on my experience fixing RunnerDeployment reliability (as a part of #1127), I'm going to add several enhancements to the RunnerSet and RunnerPod controllers to make RunnerSets even more reliable.

  • RunnerSet now creates one StatefulSet per Runner Pod, so that we have more control on when and how to scale down runner pods safely, as similar as we've done for RunnerDeployment in Add GitHub API cache to avoid rate limit #1127
  • RunnerSet spec got a new EffectiveTime field that is the same as RunnerDeployment's, which is used to prevent unnecessarily recreating completed runner pods when it's being scaled down quickly by webhook workflow_job completion events.
  • A runner pod will never be marked for deletion before unregistration. This makes it possible for a runner pod that is running a long and time-consuming workflow job to not get terminated prematurely due to pod's terminationGracePeriodSeconds.

Ref #920

@mumoshu
Copy link
Collaborator Author

mumoshu commented Mar 2, 2022

Merging anyway. But I'll keep testing both RunnerDeployment and RunnerSet and perhaps submit a PR or two to tweak details and fix anything that can be considered as a regression!

@mumoshu mumoshu merged commit 1917cf9 into master Mar 2, 2022
@mumoshu mumoshu deleted the runnerset-reliability branch March 2, 2022 10:03
@msessa
Copy link

msessa commented Mar 2, 2022

Hi @mumoshu,

I'd be happy to test these improvements on our live cluster (which has been suffering from the issues you're trying to solve here)

Do you provide nightly builds docker images of the controller by any chance ?

@mumoshu
Copy link
Collaborator Author

mumoshu commented Mar 4, 2022

@msessa Hey! Thanks for your help. I thought we'd been building controller images tagged with canary on each master build, which should be used for testing.
Would you mind checking it?

mumoshu added a commit that referenced this pull request Mar 5, 2022
mumoshu added a commit that referenced this pull request Mar 5, 2022
mumoshu added a commit that referenced this pull request Mar 6, 2022
…er-and-runnerset

Refactor Runner and RunnerSet so that they use the same library code that powers RunnerSet.

RunnerSet is StatefulSet-based and RunnerSet/Runner is Pod-based so it had been hard to unify the implementation although they look very similar in many aspects.

This change finally resolves that issue, by first introducing a library that implements the generic logic that is used to reconcile RunnerSet, then adding an adapter that can be used to let the generic logic manage runner pods via Runner, instead of via StatefulSet.

Follow-up for #1127, #1167, and 1178
@msessa
Copy link

msessa commented Mar 6, 2022

Hi @mumoshu,

I've tried to run the canary images on my cluster ( arm64 architecture ) but unfortunately I stumbled on this issue when launching the webhook server.

@msessa
Copy link

msessa commented Mar 10, 2022

@mumoshu

Just checking in on this one, after running the canary build for a few days in our live environment I can confirm this change has has a tremendous positive impact on the reliability of our runners especially with long-running jobs.

Even when running ARC with fairly aggressive scaling settings (scaleDownDelaySecondsAfterScaleOut set to 15 minutes) I have not encountered any premature pod termination or client disconnection so far.

The only thing I've encountered that is a bit off is that, when a runner pod is executing a long job and it ends up reaching the scale-down delay, the controller spams the log with several of these messages:

{"level":"error","ts":1646927418.3731344,"logger":"actions-runner-controller.runnerpod","msg":"Failed to unregister runner before deleting the pod.","runnerpod":"runners/runner-amd64-nf79g-0","error":"failed to remove runner: DELETE https://api.github.com/orgs/service-victoria/actions/runners/2728: 422 Bad request - Runner \"runner-amd64-nf79g-0\" is still running a job\" []","stacktrace":"<omitted>"}

I understand that is completely normal behavior at this point, so perhaps the error level should be lowered to info

@mumoshu
Copy link
Collaborator Author

mumoshu commented Mar 11, 2022

@msessa Hey! Thanks for the positive report. That really encourages me as I was still afraid if it can be just "works on my(@mumoshu's) environment" 😄

Re log spam, that's really great feedback!
Let me try fixing it. Probably it doesn't need to retry that often in the first place, so I might make it only retry on, say, each 30 seconds.

@msessa
Copy link

msessa commented Mar 11, 2022

Looking forward to this new code making the next release :) meanwhile I will continue running the canary and monitor for issues.

Thanks for your amazing work mate!

@mumoshu
Copy link
Collaborator Author

mumoshu commented Mar 11, 2022

@msessa Thanks for your support! FYI #1204 is my attempt to fix it.

mumoshu added a commit that referenced this pull request Mar 11, 2022
Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy.
There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job.

This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter.

The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before.

Ref #1167 (comment)
@mumoshu
Copy link
Collaborator Author

mumoshu commented Mar 11, 2022

@msessa #1204 has been merged. I'd appreciate it if you could report back if it fixes it and the log will never get spammed 😄 (The intended behavior is that the 422 error is emitted at most once per a runner pod's whole lifecycle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants