Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler do not scale down #248

Closed
jolestar opened this issue Jan 5, 2021 · 18 comments
Closed

Autoscaler do not scale down #248

jolestar opened this issue Jan 5, 2021 · 18 comments

Comments

@jolestar
Copy link

jolestar commented Jan 5, 2021

runner config:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: starcoin-runner-deployment
spec:
  template:
    spec:
      nodeSelector:
        doks.digitalocean.com/node-pool: ci-pool2
      image: starcoin/starcoin-runner:v2.275.1.20210104
      repository: starcoinorg/starcoin

      resources:
        requests:
          cpu: "24.0"
          memory: "48Gi"
        # If set to false, there are no privileged container and you cannot use docker.
        dockerEnabled: true
        # If set to true, runner pod container only 1 container that's expected to be able to run docker, too.
        # image summerwind/actions-runner-dind or custom one should be used with true -value
        dockerdWithinRunnerContainer: false
        # Valid if dockerdWithinRunnerContainer is not true
        dockerdContainerResources:
          requests:
            cpu: "24.0"
            memory: "48Gi"

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: starcoin-runner-deployment-autoscaler
spec:
  scaleTargetRef:
    name: starcoin-runner-deployment
  minReplicas: 1
  maxReplicas: 6
  scaleDownDelaySecondsAfterScaleOut: 120
  metrics:
    - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
      repositoryNames:
        - starcoinorg/starcoin
  • --sync-period=2m
kubectl  get HorizontalRunnerAutoscaler                                                                                                                                    
NAME                                    MIN   MAX   DESIRED
starcoin-runner-deployment-autoscaler   1     6     1

But if the runer autoscale to 6, it does not scale down, even I delete a runner pod manual, it will auto-create a new runner pod.

@jolestar
Copy link
Author

jolestar commented Jan 5, 2021

kubectl get RunnerReplicaSet                                                                                           
NAME                               DESIRED   CURRENT   READY
starcoin-runner-deployment-tqxmv   1         6         6

If I delete the RunnerReplicaSet, it will scale down to the desired runner count.

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 6, 2021

@jolestar Hey! Thanks for reporting.

If I delete the RunnerReplicaSet, it will scale down to the desired runner count.

Does this mean that the autoscaler did updated the desired count for your RunnerDeployment as expected, but it didn't update RunnerReplicaSet's desired count?

@jolestar
Copy link
Author

jolestar commented Jan 6, 2021

@jolestar Hey! Thanks for reporting.

If I delete the RunnerReplicaSet, it will scale down to the desired runner count.

Does this mean that the autoscaler did updated the desired count for your RunnerDeployment as expected, but it didn't update RunnerReplicaSet's desired count?

the RunnerReplicaSet's DESIRED count is right, but the CURRENT & READY is always 6. How can I get more info for diagnostic it?

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 6, 2021

@jolestar Thanks! That's helpful.

How can I get more info for diagnostic it?

runnerreplicaset_controller embedded in the actions-runner-controller pod should react as fast as possible to delete redundant runner pods, so that CURRENT would eventually decrease to the count same as DESIRED.

So perhaps your runnerreplicaset controller isn't working somehow? To diagnose, I can suggest you to grep and investigate logs from actions-runner-controller for lines containing runnerreplicaset_controller.go, e.g. kubectl logs -c manager deploy/controller-manager -n actions-runner-system | grep 'runnerreplicaset_controller.go'.

@jolestar
Copy link
Author

jolestar commented Jan 6, 2021

There some error:

2021-01-06T00:51:22.882Z        ERROR   controllers.RunnerReplicaSet    Failed to check if runner is busy
       {"runner": "default/starcoin-runner-deployment-kgbb6", "error": "runner not found"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReplicaSetReconciler).Reconcile
        /workspace/controllers/runnerreplicaset_controller.go:107
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2021-01-06T00:51:22.883Z        ERROR   controller-runtime.controller   Reconciler error        {"controller":
 "runnerreplicaset", "request": "default/starcoin-runner-deployment-kgbb6", "error": "runner not found"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
kubectl  get runner                                          
NAME                                     ORGANIZATION   REPOSITORY             LABELS   STATUS
starcoin-runner-deployment-kgbb6-78776                  starcoinorg/starcoin            Running
starcoin-runner-deployment-kgbb6-9w2bc                  starcoinorg/starcoin            Running
starcoin-runner-deployment-kgbb6-cltr7                  starcoinorg/starcoin            Pending
starcoin-runner-deployment-kgbb6-ffzbz                  starcoinorg/starcoin            Running
starcoin-runner-deployment-kgbb6-h4bh9                  starcoinorg/starcoin            Running
starcoin-runner-deployment-kgbb6-wr88j                  starcoinorg/starcoin            Running
starcoin-runner-deployment-kgbb6-xfggn                  starcoinorg/starcoin            Running
 kubectl --context do get RunnerReplicaSet 
NAME                               DESIRED   CURRENT   READY
starcoin-runner-deployment-kgbb6   1         5         4

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 6, 2021

@jolestar Thanks! The error message is definitely misleading - "runner": "default/starcoin-runner-deployment-kgbb6" should be fixed to "runnerreplicaset": "default/starcoin-runner-deployment-kgbb6", and the log message seem to not contain the actual runner name that failed.

https://github.com/summerwind/actions-runner-controller/blob/1e466ad3df15d608c403100562fbeb5e8fab4ed2/controllers/runnerreplicaset_controller.go#L55

https://github.com/summerwind/actions-runner-controller/blob/1e466ad3df15d608c403100562fbeb5e8fab4ed2/controllers/runnerreplicaset_controller.go#L107

Maybe the runner named starcoin-runner-deployment-kgbb6-cltr7 has never been registered to GitHub that prevented the controller to hang?

Would you mind browsing https://github.com/$USER/$REPO/settings/actions if the runners are for your personal project or https://github.com/$ORG/$REPO/settings/actions for organizational runners, and confirm that starcoin-runner-deployment-kgbb6-cltr7 is NOT registered, but all the other runners are registered?

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 6, 2021

What does running kubectl logs starcoin-runner-deployment-kgbb6-cltr7 show you? Any error while the runner agent within the runner pod trying to register itself to GitHub?

@jolestar
Copy link
Author

jolestar commented Jan 6, 2021

The starcoin-runner-deployment-kgbb6-cltr7 is pending, so no log output. I try to delete a runner, cltr7 is running, and output error "token expired".

Http response code: Unauthorized from 'POST https://api.github.com/actions/runner-registration'
{"message":"Token expired.","documentation_url":"https://docs.github.com/rest"}
Response status code does not indicate success: 401 (Unauthorized).

@jolestar
Copy link
Author

jolestar commented Jan 6, 2021

settings/actions page, all runner is offline:

Screen Shot 2021-01-06 at 09 21 43

@jolestar
Copy link
Author

jolestar commented Jan 6, 2021

I try to delete the RunnerReplicaSet

kubectl delete RunnerReplicaSet starcoin-runner-deployment-kgbb6

and all runners are rebuilt.

 kubectl --context do get runner                                        
NAME                                     ORGANIZATION   REPOSITORY             LABELS   STATUS
starcoin-runner-deployment-vt5bk-nhn4r                  starcoinorg/starcoin            Running
starcoin-runner-deployment-vt5bk-qsg85                  starcoinorg/starcoin            Running
starcoin-runner-deployment-vt5bk-sk8n7                  starcoinorg/starcoin            Running

Screen Shot 2021-01-06 at 09 34 01

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 6, 2021

@jolestar Thanks for your help.

I might be a bit confused but the issue can be that our "runner controller" has a possible bug that results in leaving runner pods with expired registration token forever, which prevents the corresponding runner replicaset to work(?)

It's a bit involved, but it SHOULD NOT happen as our runner controller periodically (sync period) checks the registration token:

https://github.com/summerwind/actions-runner-controller/blob/dfffd3fb6206d00e4ce017fd41a2f449b39d4ea3/controllers/runner_controller.go#L185

and once expired it replaces the token and the pod:

https://github.com/summerwind/actions-runner-controller/blob/dfffd3fb6206d00e4ce017fd41a2f449b39d4ea3/controllers/runner_controller.go#L274-L280

Perhaps it isn't working as expected or there's edge-case(s).

Out of curiosity, does creating new RunnerDeployment and decreasing the desired count immediately work? In other words, does your issue happen only when you scale-down after an hour or so?

@jolestar
Copy link
Author

jolestar commented Jan 6, 2021

It's strange that it's now scale down. Let me watch it for a while longer.

kubectl --context do get RunnerReplicaSet           
NAME                               DESIRED   CURRENT   READY
starcoin-runner-deployment-vt5bk   1         1         1

I remove the old offline runner qfql6 from the GitHub actions page.

@jolestar
Copy link
Author

jolestar commented Jan 6, 2021

Error again:

kubectl --namespace actions-runner-system logs controller-manager-5879594668-wp7mn manager

2021-01-06T04:21:19.494Z	ERROR	controllers.RunnerReplicaSet	Failed to check if runner is busy	{"runner": "default/starcoin-runner-deployment-vt5bk", "error": "runner not found"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReplicaSetReconciler).Reconcile
	/workspace/controllers/runnerreplicaset_controller.go:107
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2021-01-06T04:21:19.494Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "runnerreplicaset", "request": "default/starcoin-runner-deployment-vt5bk", "error": "runner not found"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2021-01-06T04:21:19.693Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runner", "request": "default/starcoin-runner-deployment-vt5bk-2tvk8"}
kubectl logs starcoin-runner-deployment-vt5bk-2tvk8 runner

Http response code: Unauthorized from 'POST https://api.github.com/actions/runner-registration'
{"message":"Token expired.","documentation_url":"https://docs.github.com/rest"}
Response status code does not indicate success: 401 (Unauthorized).

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 6, 2021

@jolestar Thanks. I think we're close. Would you mind sharing me the result of kubectl get po -o yaml starcoin-runner-deployment-vt5bk-2tvk8(or those of any runner pods that failed like it), with the output of date command on your machine or controller pod? (

@jolestar
Copy link
Author

jolestar commented Jan 7, 2021

There another error:

Starting Runner listener with startup type: service
Started listener process
Started running service
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.

kubectl get po -o yaml starcoin-runner-deployment-68gdh-wwgzm

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2021-01-06T18:05:55Z"
  labels:
    pod-template-hash: 749fd4569f
    runner-template-hash: 6d59d7cd4b

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 9, 2021

@jolestar Hey! Are you still using actions-runner-controller?

FYI, I've recently summarized how our controller can get stuck due to runners being unable to be registered for various reasons.

We're far from "fixing" all the root causes because they vary a lot but the universal fix can be #297 which is I'm working on currently.

@jolestar
Copy link
Author

jolestar commented Apr 8, 2021

I try the actions-runner-controller of v0.18.2, this bug is resolved.

@jolestar jolestar closed this as completed Apr 8, 2021
@mumoshu
Copy link
Collaborator

mumoshu commented Apr 8, 2021

@jolestar Thanks for reporting! Glad to hear it worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants