Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner Container fails to terminate when runner fails to register with github #169

Closed
ZacharyBenamram opened this issue Nov 11, 2020 · 20 comments

Comments

@ZacharyBenamram
Copy link
Contributor

Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.

This prevents the pod from ever recycling and re-registering.

The above can come about when the pod runner resource has been created, but fails to create the containers because the control plane fails to provision the pod to a node due to scheduling errors.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 11, 2020

@ZacharyBenamram Hey! Thanks for reporting. Which version of the controller are you using? This is likely due to that recent changes in the controller(I'm trying to improve the pre-release testing process to prevent things like this in near future, but not yet)

@ZacharyBenamram
Copy link
Contributor Author

ZacharyBenamram commented Nov 11, 2020

@mumoshu - I am running an older version of the controller v0.10.0. I think this may be a persistent issue because the runsvc doesnt exit in the above case. And because it doesnt exit, this condition in runner_controller.go isnt triggered:

https://github.com/summerwind/actions-runner-controller/blob/e613219a8980c88977f8231bedbe691b5caa3d9c/controllers/runner_controller.go#L177

I think we need to adjust the docker container to exit on the condition where the runner is not configured.

@ZacharyBenamram
Copy link
Contributor Author

Events:
  Type     Reason             Age                    From                Message
  ----     ------             ----                   ----                -------
  Normal   NotTriggerScaleUp  27m (x23 over 53m)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max node group size reached, 3 node(s) had taints that the pod didn't tolerate
  Warning  FailedScheduling   3m7s (x102 over 53m)   default-scheduler   0/23 nodes are available: 1 Insufficient memory, 20 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.
  Normal   NotTriggerScaleUp  2m12s (x267 over 53m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) had taints that the pod didn't tolerate, 1 max node group size reached

@ZacharyBenamram
Copy link
Contributor Author

ZacharyBenamram commented Nov 11, 2020

since its treated as a retryable error:
https://github.com/summerwind/actions-runner-controller/blob/e613219a8980c88977f8231bedbe691b5caa3d9c/runner/patched/RunnerService.js#L52

Im not sure what the best approach is. Do we not want to exit in this case? I think it might be a good thing to reconsider exiting.

or even better - lets create a retryable count limit. Will open a PR to address.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 11, 2020

@ZacharyBenamram Thanks! Before settling the direction may I ask something - Do you have any insight on why your runner is saying "Not configured"?

Would there be anything the controller can do (additional validation?) to prevent it from becoming that state?

@ZacharyBenamram
Copy link
Contributor Author

it may have been a result of the artifact from the previous broken release - since I'm unable to reproduce when testing for failed pod scheduling.

@eb-trigo
Copy link

I'm running into the same issue with the latest version of summerwind/actions-runner-dind:v2.274.1

2020-11-22T12:33:11.133608629Z Starting Runner listener with startup type: service
2020-11-22T12:33:11.13571091Z Started listener process
2020-11-22T12:33:11.285920354Z An error occurred: Not configured
2020-11-22T12:33:11.304413342Z Runner listener exited with error code 2
2020-11-22T12:33:11.304438317Z Runner listener exit with retryable error, re-launch runner in 5 seconds.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 24, 2020

@eb-trigo Hey! Are you sure you've set up all the Runner config properly? Like setting dockerdWithinRunnerContainer: true

@BrendanGalloway
Copy link
Contributor

BrendanGalloway commented Jan 6, 2021

@ZacharyBenamram Thanks! Before settling the direction may I ask something - Do you have any insight on why your runner is saying "Not configured"?

Would there be anything the controller can do (additional validation?) to prevent it from becoming that state?

I've been seeing this failure a lot, it seems to happen due to an error in the ./config.sh step that the service assumes will be corrected with manual intervention. So far I've seen the config fail due to expired github tokens and the github api being unresponsive. There's an upstream request to change the behaviour so that the service eventually exits if it is not corrected (actions/runner#879), but pending that, an option might be to exit the entrypoint script if the config step does not report success?

@arjunmadan
Copy link

We've been seeing this happen pretty frequently as well. The issue in our case seems to be expired github tokens. If we leave the runner listening for too long, the tokens expire and are never refreshed so we're never going to be able to successfully reconnect. I think @ZacharyBenamram had the right idea with retrying a few times to make sure it isn't something temporary, and then stopping things.

@ZacharyBenamram - I'm happy to open a similar PR if you're busy with other things.

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 7, 2021

it seems to happen due to an error in the ./config.sh step that the service assumes will be corrected with manual intervention. So far I've seen the config fail due to expired github tokens and the github api being unresponsive.

Related: #248 (comment).

If the config.sh failure was due to an expired registration token - it shouldn't happen as the controller should update the token and recreate the pod with the new token once it detected expiration.

Maybe that reg token update/pod recreation code isn't working as we intended? 🤔

@arjunmadan
Copy link

Ah yeah that's probably what's happening if there is something in place to try and update the token and recreate the pod. Would it still make sense to have a max limit on the number of retries? Spinning up new pods in kubernetes or openshift is pretty low cost, so maybe after some number of failures, the controller just kills the pod and lets a new one come up in its place?

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 8, 2021

Would it still make sense to have a max limit on the number of retries?

Thanks for the suggestion! If that works, your pod should come up fine after you delete the pod today. Is that the case?

@arjunmadan
Copy link

Yup, that's exactly what happens. The pod gets stuck in this retry loop, and then when I terminate it the controller creates a new pod which connects successfully. If I leave it unused for about a day, it goes back into the retry loop, but otherwise it executes whatever job it picks up and terminates successfully.

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 8, 2021

@arjunmadan Thanks for the info! Whoa, interesting. Then the token might have been updated correctly, but the runner-controller might have been unable to recreate the pod with the new token without your manual intervention.

Would you mind giving me the logs from your actions-runner-controller's controller-manager pod, where it might contain some useful logs including if it has successfully updated the token or not, and/or there has been any errors that prevented the pod to be recreated by the controller before your intervention?

@arjunmadan
Copy link

Did some digging and it's possible we're running into quota issues when the controller tries getting this pod back up after it refreshes the tokens. Have some logs below that show the token being refreshed. We're going to make those quota changes and see if the problem goes away. Will report back early next week.

2021-01-04T16:02:40.909Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Runner","namespace":"numbers-gha","name":"iris-stable-gha-runners-789p7-98f4n","uid":"268ef900-4ea5-11eb-8b5e-001a4aa86607","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"945796767"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}
2021-01-04T16:02:40.909Z	INFO	controllers.Runner	Updated registration token	{"runner": "iris-stable-gha-runners-789p7-98f4n", "repository": "Bandwidth/iris"}
2021-01-04T16:02:40.293Z	INFO	controllers.Runner	Updated registration token	{"runner": "iris-stable-gha-runners-789p7-98f4n", "repository": "Bandwidth/iris"}
2021-01-04T16:02:40.293Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Runner","namespace":"numbers-gha","name":"iris-stable-gha-runners-789p7-98f4n","uid":"268ef900-4ea5-11eb-8b5e-001a4aa86607","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"945796767"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}

@arjunmadan
Copy link

Yup, so it turns out it was quota issues after all. After we got that sorted out, things seem to be working fine, and I can see the runner reconnects to GitHub after about a day of unuse.

@FrederikNJS
Copy link

I think I might be running into this issue as well. I set up the runners on Tuesday, and today Thursday, they were all spewing out that "Not Configured" error. I deleted all the pods, and they all recovered.

Which quota did you change and how @arjunmadan?

@arjunmadan
Copy link

Our issue turned out to be with the quotas we gave the namespace. We were maxing out memory, and so while the pod didn't have enough memory available to cleanly restart. Upping the namespace quotas where we were hosting the runner pods helped solve our issue.

@ZacharyBenamram
Copy link
Contributor Author

Hey - circling back here. I ran into this issue again and it turned out to be a different flavor of what @arjunmadan has noted above. In my case, the ec2 machine we were running on did not have enough available memory to support the number of runner containers + infrastructure containers i was running on the github actions nodes. Once i bumped the memory of the ec2 node, the problem disappeared.

Interesting error though... seems to encapsulate a lot of failure cases. I think we should look back into limiting the number of restarts available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants