Runner not ready after timeout during startup-authentication #288

hdoinfodation · 2021-02-01T17:46:10Z

Currently we are running the actions-runner-controller in a network environment that is a bit unstable. The following issue has bitten us quite a bit:

It is possible that the runner pod is stuck in a loop, whenever the runner times out to Github during the authentication phase of the start-up.
If it times-out during this step, the runner will be stuck in an infinite loop, while it is not registered in Github.

Github endpoint URL https://github.com/
mv: cannot remove '/runnertmp/patched/runsvc.sh': Permission denied
mv: cannot remove '/runnertmp/patched/RunnerService.js': Permission denied

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Resource temporarily unavailable
16c16
< ./externals/node12/bin/node ./bin/RunnerService.js &
---
> ./externals/node12/bin/node ./bin/RunnerService.js $* &
20c20
< wait $PID
---
> wait $PID
\ No newline at end of file
27c27
<                 listener = childProcess.spawn(listenerExePath, ['run'], { env: process.env });
---
>                 listener = childProcess.spawn(listenerExePath, ['run'].concat(process.argv.slice(3)), { env: process.env });
30c30
<                 listener = childProcess.spawn(listenerExePath, ['run', '--startuptype', 'service'], { env: process.env });
---
>                 listener = childProcess.spawn(listenerExePath, ['run', '--startuptype', 'service'].concat(process.argv.slice(2)), { env: process.env });
34c34
<         
---
> 
59c59
<                 
---
> 
91c91
< });
---
> });
\ No newline at end of file
.path=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Starting Runner listener with startup type: service
Started listener process
Started running service
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service

A manual restart/delete action is required to make it work again. This also has an effect on the autoscaling since the pod is stuck in an infinite loop.

Would have expected that the runner pod crashes itself whenever it times out during the authentication phase, to force the runner to re-authenticate or that there is some kind of retry mechanism after a timeout.

Note, in normal conditions when a runner authenticates succesfully and listens for jobs, when it times out afterwards, it properly reconnects/registers itself to the Github API.

The text was updated successfully, but these errors were encountered:

mumoshu · 2021-02-06T07:48:56Z

@hdoinfodation Hey! Thanks for reporting. The surfacing error seems the same as we saw before in #169.

AFAICS, we have seen the following variants of this failure so far:

Timeout while authenticating against GitHub (@hdoinfodation, this issue)
Node resource shortage (@ZacharyBenamram Runner Container fails to terminate when runner fails to register with github #169 (comment), and maybe @FrederikNS? Runner Container fails to terminate when runner fails to register with github #169 (comment))
Insufficient ResourceQuota (@arjunmadan Runner Container fails to terminate when runner fails to register with github #169 (comment))
Expired GitHub token, or GitHub API failure (@BrendanGalloway Runner Container fails to terminate when runner fails to register with github #169 (comment))

I also believe the other side of this same problem is errors like Failed to check if runner is busy. {"runner": "actions-runner-system/runner-deployment", "error": "runner not found"} from the runner controller that are observed by @callum-tait-pbx et al in e.g. #183. The runner is not found in GitHub because it failed to successfully start or authenticate.

So how should we deal with it?

To clarify, I believe we have no way to resolve the root causes like resource shortage or expired token, API failure on the controller-side. The controller should not automatically replace the node to scale up. The controller has no way to reliably determine if the runner process within the runner pod is failing or not. The controller has no way to replace the already expired token installed onto the pod without restart, etc.

But how about if we just enhanced the controller to automatically recreated the pod with the new token, if it failed to register itself to GitHub within some period? You don't need to manually recreate the runner pod to fix it anymore.

Let's say the period was 10 minute - the controller can periodically call ListRunners API to determine if a specific runner is registered to GitHub or not. If it didn't appear in API 10 minute after pod creation, the controller recreates the pod with a new token. All you need would be to detect the issue by seeing controller logs, and then fix the root cause.

If it makes sense, I would definitely like to add some patch to the runner controller for this auto-recreate-runner-pod-on-registration-timeout.

WDYT?

ZacharyBenamram · 2021-02-07T07:46:13Z

I've looked into this a bit further. This error happens when the /runner/.runner config file does not exist in the runner container. Could you exec into the runner container and confirm? Error is thrown here: https://github.com/actions/runner/blob/8a4cb7650850a5b74882be714353362e86f74789/src/Runner.Common/ConfigurationStore.cs#L168

If you look into the /runner/_diag folder and cat the first log file - you might see an exception being thrown, which is most likely the root cause - if not, i would dig around _diag logs to find the exception thats thrown that is not setting the config file appropraitely.

I am currently following this thread: actions/runner-images#668 which may be a possible root cause.

This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded). This should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc). Formerly you had to manually detect and remove those pods to unblock the controller. Since this enhancement, the controller does it automatically after 10 minutes after pod creation. Ref #288

BrendanGalloway · 2021-02-08T07:49:13Z

@ZacharyBenamram In my case the runner config file did not exist, as the step that should generate it at https://github.com/summerwind/actions-runner-controller/blob/ab1c39de5732d449fe129b64b95b317aab11a6bf/runner/entrypoint.sh#L59 was failing. While it might not resolve the root cause, it might be worth while to have the container exit if that step fails, since the runner logic will keep trying and failing to register without any error state being visible at the k8s level.

mumoshu · 2021-02-09T01:09:57Z

@BrendanGalloway That sounds like a good idea. So, would it be a matter of just adding set -e before https://github.com/summerwind/actions-runner-controller/blob/master/runner/entrypoint.sh#L59 or adding some if [ $? -ne 0 ]; then exit 1; fi block after it to handle config.sh errors?

Anyway, even for your case I believe #297 would help, by detecting the config.sh failure after 10 minute after pod creation and recreate it. Could you confirm?

This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded). It should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc). Formerly you had to manually detect and delete such pods or even force-delete corresponding runners to unblock the controller. Since this enhancement, the controller does the pod deletion automatically after 10 minutes after pod creation, which result in the controller create another pod that might work. Ref #288

heyvito · 2021-04-27T13:52:21Z

Hey folks! So glad I found this discussion here. I'm having the very same issue and could confirm the runner container lacks a /runner/.runner file. The controller seemed fine, and it also just restarted the runner while I was typing this message.

mumoshu · 2021-04-27T23:45:29Z

@heyvito Hey! You can't avoid the error and automatic recreation of the pod due to that 100%. Are you using actions-runner-controller 0.18.2? Then the fix for this #297 is already included so that's why you've seen that it restarted the pod.

mumoshu · 2021-04-27T23:46:24Z

Can we close this? This error turned out to be caused by many transient errors which should be automatically handled in actions-runer-controller 0.18.0 or greater.

toast-gear · 2021-05-18T15:18:54Z

Not sure why stale isn't picking this up. The error was caused by transient issues, improvements have been made with the runner to better handle transient errors so I am closing this.

mumoshu mentioned this issue Feb 8, 2021

feat: Prevent blocking on transient runner registration failure #297

Merged

mumoshu mentioned this issue Feb 9, 2021

Autoscaler do not scale down #248

Closed

toast-gear closed this as completed May 18, 2021

hammadzz mentioned this issue Jul 9, 2021

Controller creates a lot of runners #467

Closed

dhawalseth mentioned this issue Apr 3, 2023

Runner pod statusUpdateHook not working and unlimited runners start provisioning #2468

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner not ready after timeout during startup-authentication #288

Runner not ready after timeout during startup-authentication #288

hdoinfodation commented Feb 1, 2021

mumoshu commented Feb 6, 2021 •

edited

Loading

ZacharyBenamram commented Feb 7, 2021 •

edited

Loading

BrendanGalloway commented Feb 8, 2021

mumoshu commented Feb 9, 2021 •

edited

Loading

heyvito commented Apr 27, 2021

mumoshu commented Apr 27, 2021

mumoshu commented Apr 27, 2021

toast-gear commented May 18, 2021

Runner not ready after timeout during startup-authentication #288

Runner not ready after timeout during startup-authentication #288

Comments

hdoinfodation commented Feb 1, 2021

mumoshu commented Feb 6, 2021 • edited Loading

ZacharyBenamram commented Feb 7, 2021 • edited Loading

BrendanGalloway commented Feb 8, 2021

mumoshu commented Feb 9, 2021 • edited Loading

heyvito commented Apr 27, 2021

mumoshu commented Apr 27, 2021

mumoshu commented Apr 27, 2021

toast-gear commented May 18, 2021

mumoshu commented Feb 6, 2021 •

edited

Loading

ZacharyBenamram commented Feb 7, 2021 •

edited

Loading

mumoshu commented Feb 9, 2021 •

edited

Loading