Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner not ready after timeout during startup-authentication #288

Closed
hdoinfodation opened this issue Feb 1, 2021 · 8 comments
Closed

Runner not ready after timeout during startup-authentication #288

hdoinfodation opened this issue Feb 1, 2021 · 8 comments

Comments

@hdoinfodation
Copy link

Currently we are running the actions-runner-controller in a network environment that is a bit unstable. The following issue has bitten us quite a bit:

It is possible that the runner pod is stuck in a loop, whenever the runner times out to Github during the authentication phase of the start-up.
If it times-out during this step, the runner will be stuck in an infinite loop, while it is not registered in Github.

Github endpoint URL https://github.com/
mv: cannot remove '/runnertmp/patched/runsvc.sh': Permission denied
mv: cannot remove '/runnertmp/patched/RunnerService.js': Permission denied

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Resource temporarily unavailable
16c16
< ./externals/node12/bin/node ./bin/RunnerService.js &
---
> ./externals/node12/bin/node ./bin/RunnerService.js $* &
20c20
< wait $PID
---
> wait $PID
\ No newline at end of file
27c27
<                 listener = childProcess.spawn(listenerExePath, ['run'], { env: process.env });
---
>                 listener = childProcess.spawn(listenerExePath, ['run'].concat(process.argv.slice(3)), { env: process.env });
30c30
<                 listener = childProcess.spawn(listenerExePath, ['run', '--startuptype', 'service'], { env: process.env });
---
>                 listener = childProcess.spawn(listenerExePath, ['run', '--startuptype', 'service'].concat(process.argv.slice(2)), { env: process.env });
34c34
<         
---
> 
59c59
<                 
---
> 
91c91
< });
---
> });
\ No newline at end of file
.path=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Starting Runner listener with startup type: service
Started listener process
Started running service
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service

A manual restart/delete action is required to make it work again. This also has an effect on the autoscaling since the pod is stuck in an infinite loop.

Would have expected that the runner pod crashes itself whenever it times out during the authentication phase, to force the runner to re-authenticate or that there is some kind of retry mechanism after a timeout.

Note, in normal conditions when a runner authenticates succesfully and listens for jobs, when it times out afterwards, it properly reconnects/registers itself to the Github API.

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 6, 2021

@hdoinfodation Hey! Thanks for reporting. The surfacing error seems the same as we saw before in #169.

AFAICS, we have seen the following variants of this failure so far:

I also believe the other side of this same problem is errors like Failed to check if runner is busy. {"runner": "actions-runner-system/runner-deployment", "error": "runner not found"} from the runner controller that are observed by @callum-tait-pbx et al in e.g. #183. The runner is not found in GitHub because it failed to successfully start or authenticate.

So how should we deal with it?

To clarify, I believe we have no way to resolve the root causes like resource shortage or expired token, API failure on the controller-side. The controller should not automatically replace the node to scale up. The controller has no way to reliably determine if the runner process within the runner pod is failing or not. The controller has no way to replace the already expired token installed onto the pod without restart, etc.

But how about if we just enhanced the controller to automatically recreated the pod with the new token, if it failed to register itself to GitHub within some period? You don't need to manually recreate the runner pod to fix it anymore.

Let's say the period was 10 minute - the controller can periodically call ListRunners API to determine if a specific runner is registered to GitHub or not. If it didn't appear in API 10 minute after pod creation, the controller recreates the pod with a new token. All you need would be to detect the issue by seeing controller logs, and then fix the root cause.

If it makes sense, I would definitely like to add some patch to the runner controller for this auto-recreate-runner-pod-on-registration-timeout.

WDYT?

@ZacharyBenamram
Copy link
Contributor

ZacharyBenamram commented Feb 7, 2021

I've looked into this a bit further. This error happens when the /runner/.runner config file does not exist in the runner container. Could you exec into the runner container and confirm? Error is thrown here: https://github.com/actions/runner/blob/8a4cb7650850a5b74882be714353362e86f74789/src/Runner.Common/ConfigurationStore.cs#L168

If you look into the /runner/_diag folder and cat the first log file - you might see an exception being thrown, which is most likely the root cause - if not, i would dig around _diag logs to find the exception thats thrown that is not setting the config file appropraitely.

I am currently following this thread: actions/runner-images#668 which may be a possible root cause.

mumoshu added a commit that referenced this issue Feb 8, 2021
This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded). This should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc). Formerly you had to manually detect and remove those pods to unblock the controller. Since this enhancement, the controller does it automatically after 10 minutes after pod creation.

Ref #288
@BrendanGalloway
Copy link
Contributor

@ZacharyBenamram In my case the runner config file did not exist, as the step that should generate it at https://github.com/summerwind/actions-runner-controller/blob/ab1c39de5732d449fe129b64b95b317aab11a6bf/runner/entrypoint.sh#L59 was failing. While it might not resolve the root cause, it might be worth while to have the container exit if that step fails, since the runner logic will keep trying and failing to register without any error state being visible at the k8s level.

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 9, 2021

@BrendanGalloway That sounds like a good idea. So, would it be a matter of just adding set -e before https://github.com/summerwind/actions-runner-controller/blob/master/runner/entrypoint.sh#L59 or adding some if [ $? -ne 0 ]; then exit 1; fi block after it to handle config.sh errors?

Anyway, even for your case I believe #297 would help, by detecting the config.sh failure after 10 minute after pod creation and recreate it. Could you confirm?

mumoshu added a commit that referenced this issue Feb 9, 2021
This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded).

It should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc).

Formerly you had to manually detect and delete such pods or even force-delete corresponding runners to unblock the controller.

Since this enhancement, the controller does the pod deletion automatically after 10 minutes after pod creation, which result in the controller create another pod that might work.

Ref #288
@heyvito
Copy link

heyvito commented Apr 27, 2021

Hey folks! So glad I found this discussion here. I'm having the very same issue and could confirm the runner container lacks a /runner/.runner file. The controller seemed fine, and it also just restarted the runner while I was typing this message.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 27, 2021

@heyvito Hey! You can't avoid the error and automatic recreation of the pod due to that 100%. Are you using actions-runner-controller 0.18.2? Then the fix for this #297 is already included so that's why you've seen that it restarted the pod.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 27, 2021

Can we close this? This error turned out to be caused by many transient errors which should be automatically handled in actions-runer-controller 0.18.0 or greater.

@toast-gear
Copy link
Collaborator

Not sure why stale isn't picking this up. The error was caused by transient issues, improvements have been made with the runner to better handle transient errors so I am closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants