Runner Container fails to terminate when runner fails to register with github #169

ZacharyBenamram · 2020-11-11T01:20:34Z

Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.

This prevents the pod from ever recycling and re-registering.

The above can come about when the pod runner resource has been created, but fails to create the containers because the control plane fails to provision the pod to a node due to scheduling errors.

The text was updated successfully, but these errors were encountered:

mumoshu · 2020-11-11T01:22:34Z

@ZacharyBenamram Hey! Thanks for reporting. Which version of the controller are you using? This is likely due to that recent changes in the controller(I'm trying to improve the pre-release testing process to prevent things like this in near future, but not yet)

ZacharyBenamram · 2020-11-11T01:36:38Z

@mumoshu - I am running an older version of the controller v0.10.0. I think this may be a persistent issue because the runsvc doesnt exit in the above case. And because it doesnt exit, this condition in runner_controller.go isnt triggered:

https://github.com/summerwind/actions-runner-controller/blob/e613219a8980c88977f8231bedbe691b5caa3d9c/controllers/runner_controller.go#L177

I think we need to adjust the docker container to exit on the condition where the runner is not configured.

ZacharyBenamram · 2020-11-11T02:01:57Z

Events:
  Type     Reason             Age                    From                Message
  ----     ------             ----                   ----                -------
  Normal   NotTriggerScaleUp  27m (x23 over 53m)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max node group size reached, 3 node(s) had taints that the pod didn't tolerate
  Warning  FailedScheduling   3m7s (x102 over 53m)   default-scheduler   0/23 nodes are available: 1 Insufficient memory, 20 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.
  Normal   NotTriggerScaleUp  2m12s (x267 over 53m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) had taints that the pod didn't tolerate, 1 max node group size reached

ZacharyBenamram · 2020-11-11T02:10:24Z

since its treated as a retryable error:
https://github.com/summerwind/actions-runner-controller/blob/e613219a8980c88977f8231bedbe691b5caa3d9c/runner/patched/RunnerService.js#L52

Im not sure what the best approach is. Do we not want to exit in this case? I think it might be a good thing to reconsider exiting.

or even better - lets create a retryable count limit. Will open a PR to address.

mumoshu · 2020-11-11T02:19:28Z

@ZacharyBenamram Thanks! Before settling the direction may I ask something - Do you have any insight on why your runner is saying "Not configured"?

Would there be anything the controller can do (additional validation?) to prevent it from becoming that state?

ZacharyBenamram · 2020-11-11T17:58:35Z

it may have been a result of the artifact from the previous broken release - since I'm unable to reproduce when testing for failed pod scheduling.

eb-trigo · 2020-11-22T12:33:53Z

I'm running into the same issue with the latest version of summerwind/actions-runner-dind:v2.274.1

2020-11-22T12:33:11.133608629Z Starting Runner listener with startup type: service
2020-11-22T12:33:11.13571091Z Started listener process
2020-11-22T12:33:11.285920354Z An error occurred: Not configured
2020-11-22T12:33:11.304413342Z Runner listener exited with error code 2
2020-11-22T12:33:11.304438317Z Runner listener exit with retryable error, re-launch runner in 5 seconds.

mumoshu · 2020-11-24T00:49:08Z

@eb-trigo Hey! Are you sure you've set up all the Runner config properly? Like setting dockerdWithinRunnerContainer: true

BrendanGalloway · 2021-01-06T11:40:21Z

@ZacharyBenamram Thanks! Before settling the direction may I ask something - Do you have any insight on why your runner is saying "Not configured"?

Would there be anything the controller can do (additional validation?) to prevent it from becoming that state?

I've been seeing this failure a lot, it seems to happen due to an error in the ./config.sh step that the service assumes will be corrected with manual intervention. So far I've seen the config fail due to expired github tokens and the github api being unresponsive. There's an upstream request to change the behaviour so that the service eventually exits if it is not corrected (actions/runner#879), but pending that, an option might be to exit the entrypoint script if the config step does not report success?

arjunmadan · 2021-01-06T20:30:10Z

We've been seeing this happen pretty frequently as well. The issue in our case seems to be expired github tokens. If we leave the runner listening for too long, the tokens expire and are never refreshed so we're never going to be able to successfully reconnect. I think @ZacharyBenamram had the right idea with retrying a few times to make sure it isn't something temporary, and then stopping things.

@ZacharyBenamram - I'm happy to open a similar PR if you're busy with other things.

mumoshu · 2021-01-07T00:54:07Z

it seems to happen due to an error in the ./config.sh step that the service assumes will be corrected with manual intervention. So far I've seen the config fail due to expired github tokens and the github api being unresponsive.

Related: #248 (comment).

If the config.sh failure was due to an expired registration token - it shouldn't happen as the controller should update the token and recreate the pod with the new token once it detected expiration.

Maybe that reg token update/pod recreation code isn't working as we intended? 🤔

arjunmadan · 2021-01-07T16:40:51Z

Ah yeah that's probably what's happening if there is something in place to try and update the token and recreate the pod. Would it still make sense to have a max limit on the number of retries? Spinning up new pods in kubernetes or openshift is pretty low cost, so maybe after some number of failures, the controller just kills the pod and lets a new one come up in its place?

mumoshu · 2021-01-08T00:14:03Z

Would it still make sense to have a max limit on the number of retries?

Thanks for the suggestion! If that works, your pod should come up fine after you delete the pod today. Is that the case?

arjunmadan · 2021-01-08T00:22:19Z

Yup, that's exactly what happens. The pod gets stuck in this retry loop, and then when I terminate it the controller creates a new pod which connects successfully. If I leave it unused for about a day, it goes back into the retry loop, but otherwise it executes whatever job it picks up and terminates successfully.

mumoshu · 2021-01-08T00:49:21Z

@arjunmadan Thanks for the info! Whoa, interesting. Then the token might have been updated correctly, but the runner-controller might have been unable to recreate the pod with the new token without your manual intervention.

Would you mind giving me the logs from your actions-runner-controller's controller-manager pod, where it might contain some useful logs including if it has successfully updated the token or not, and/or there has been any errors that prevented the pod to be recreated by the controller before your intervention?

arjunmadan · 2021-01-08T19:02:27Z

Did some digging and it's possible we're running into quota issues when the controller tries getting this pod back up after it refreshes the tokens. Have some logs below that show the token being refreshed. We're going to make those quota changes and see if the problem goes away. Will report back early next week.

2021-01-04T16:02:40.909Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Runner","namespace":"numbers-gha","name":"iris-stable-gha-runners-789p7-98f4n","uid":"268ef900-4ea5-11eb-8b5e-001a4aa86607","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"945796767"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}
2021-01-04T16:02:40.909Z	INFO	controllers.Runner	Updated registration token	{"runner": "iris-stable-gha-runners-789p7-98f4n", "repository": "Bandwidth/iris"}
2021-01-04T16:02:40.293Z	INFO	controllers.Runner	Updated registration token	{"runner": "iris-stable-gha-runners-789p7-98f4n", "repository": "Bandwidth/iris"}
2021-01-04T16:02:40.293Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Runner","namespace":"numbers-gha","name":"iris-stable-gha-runners-789p7-98f4n","uid":"268ef900-4ea5-11eb-8b5e-001a4aa86607","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"945796767"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}

arjunmadan · 2021-01-12T15:06:42Z

Yup, so it turns out it was quota issues after all. After we got that sorted out, things seem to be working fine, and I can see the runner reconnects to GitHub after about a day of unuse.

FrederikNJS · 2021-01-21T11:23:00Z

I think I might be running into this issue as well. I set up the runners on Tuesday, and today Thursday, they were all spewing out that "Not Configured" error. I deleted all the pods, and they all recovered.

Which quota did you change and how @arjunmadan?

arjunmadan · 2021-01-21T14:21:45Z

Our issue turned out to be with the quotas we gave the namespace. We were maxing out memory, and so while the pod didn't have enough memory available to cleanly restart. Upping the namespace quotas where we were hosting the runner pods helped solve our issue.

ZacharyBenamram · 2021-02-06T04:35:35Z

Hey - circling back here. I ran into this issue again and it turned out to be a different flavor of what @arjunmadan has noted above. In my case, the ec2 machine we were running on did not have enough available memory to support the number of runner containers + infrastructure containers i was running on the github actions nodes. Once i bumped the memory of the ec2 node, the problem disappeared.

Interesting error though... seems to encapsulate a lot of failure cases. I think we should look back into limiting the number of restarts available.

ZacharyBenamram mentioned this issue Nov 11, 2020

Add retryable error count limit to RunnerService.js #170

Closed

ZacharyBenamram closed this as completed Nov 19, 2020

mumoshu mentioned this issue Feb 6, 2021

Runner not ready after timeout during startup-authentication #288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner Container fails to terminate when runner fails to register with github #169

Runner Container fails to terminate when runner fails to register with github #169

ZacharyBenamram commented Nov 11, 2020

mumoshu commented Nov 11, 2020

ZacharyBenamram commented Nov 11, 2020 •

edited

Loading

ZacharyBenamram commented Nov 11, 2020

ZacharyBenamram commented Nov 11, 2020 •

edited

Loading

mumoshu commented Nov 11, 2020

ZacharyBenamram commented Nov 11, 2020

eb-trigo commented Nov 22, 2020

mumoshu commented Nov 24, 2020

BrendanGalloway commented Jan 6, 2021 •

edited

Loading

arjunmadan commented Jan 6, 2021

mumoshu commented Jan 7, 2021

arjunmadan commented Jan 7, 2021

mumoshu commented Jan 8, 2021

arjunmadan commented Jan 8, 2021

mumoshu commented Jan 8, 2021 •

edited

Loading

arjunmadan commented Jan 8, 2021

arjunmadan commented Jan 12, 2021

FrederikNJS commented Jan 21, 2021

arjunmadan commented Jan 21, 2021

ZacharyBenamram commented Feb 6, 2021

Runner Container fails to terminate when runner fails to register with github #169

Runner Container fails to terminate when runner fails to register with github #169

Comments

ZacharyBenamram commented Nov 11, 2020

mumoshu commented Nov 11, 2020

ZacharyBenamram commented Nov 11, 2020 • edited Loading

ZacharyBenamram commented Nov 11, 2020

ZacharyBenamram commented Nov 11, 2020 • edited Loading

mumoshu commented Nov 11, 2020

ZacharyBenamram commented Nov 11, 2020

eb-trigo commented Nov 22, 2020

mumoshu commented Nov 24, 2020

BrendanGalloway commented Jan 6, 2021 • edited Loading

arjunmadan commented Jan 6, 2021

mumoshu commented Jan 7, 2021

arjunmadan commented Jan 7, 2021

mumoshu commented Jan 8, 2021

arjunmadan commented Jan 8, 2021

mumoshu commented Jan 8, 2021 • edited Loading

arjunmadan commented Jan 8, 2021

arjunmadan commented Jan 12, 2021

FrederikNJS commented Jan 21, 2021

arjunmadan commented Jan 21, 2021

ZacharyBenamram commented Feb 6, 2021

ZacharyBenamram commented Nov 11, 2020 •

edited

Loading

ZacharyBenamram commented Nov 11, 2020 •

edited

Loading

BrendanGalloway commented Jan 6, 2021 •

edited

Loading

mumoshu commented Jan 8, 2021 •

edited

Loading