Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057

Closed
jbkc85 opened this issue Jan 18, 2022 · 27 comments

Comments

@jbkc85
Copy link

jbkc85 commented Jan 18, 2022

Describe the bug
We are still battling the 'cancellation' bug in our organization, many times running into pod reconciles at times where it doesn't make sense. Typically the scale up works (tho since 0.20.1 its been scaling up tremendously slowly, if it even scales up to the proper amount of runners needed - more in details), but scaling down happens prematurely and cancels active workflow jobs. I have been able to test the 0.20.4 version but cancellations occur even more often than not with that release, and that includes being capped at 24 runners for some odd reason. I started monitoring the suggested replicas and am seeing a lot of oddities, which I am not yet able to map to any specific part of the code...

Checks

Example Log from Controller:

2022-01-18T19:15:05.548Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 8 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 8, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "default", "kind": "runnerdeployment", "name": "runners", "horizontal_runner_autoscaler": "runners-autoscaler"}

Example from CLI written w/ same code to grab workflow runs/jobs:

./airbin/checkctl github list_jobs
2022/01/18 13:15:08 [DEBUG] Total Count: 1 for Perform release (queued)
-----workflow stats for 1713612691-----Jobs Completed: 65Jobs Queued: 18Jobs In Progress: 46Jobs Unknown: 0

To Reproduce
Steps to reproduce the behavior:

  1. Use TotalNumberOfQueuedAndInProgressWorkflowRuns
  2. Use deployment from screenshots below
  3. Launch workflow with 100+ jobs (in matrix if possible)

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  namespace: default
  name: runners
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      dockerdWithinRunnerContainer: true
      organization: myOrganization
      nodeSelector:
        com.example/capacity-type: SPOT
        com.example/purpose: github-runners
        com.example/environment: ops
      labels:
        - java
        - linux
        - eks
        - self-hosted
      resources:
        requests:
          cpu: "1.0"
          memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  namespace: default
  name: runners-autoscaler
spec:
  scaleTargetRef:
    name:runners
  scaleDownDelaySecondsAfterScaleOut: 7200
  minReplicas: 1
  maxReplicas: 150
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - my_repository

Environment (please complete the following information):

  • Controller Version [0.20.1->0.20.4]
  • Deployment Method: Helm

Additional context

This has been an ongoing issue but since the upgrade to the 0.20.x branch I believe its gotten worse. I am not sure if its the GitHub API or something in the calculations to determine the suggested replicas. Either way, is there anything I am missing here? I am not sure why I can get such different results on my own GitHub API calls from command line vs that of the controller's autoscaler suggested replicas.

scaleDownDelaySecondsAfterScaleOut: 7200 was put in to battle some of the unnecessary kills, it isn't working as well though because after a scale up and the runners stay at scale, it eventually causes runners to get interrupted (by controller kills) during the middle of tests as well.

@jbkc85
Copy link
Author

jbkc85 commented Jan 18, 2022

I might be caught on the whole "this is fixed but not released" - as I see the pagination was added recently but isn't in release 0.20.4 yet. I built it locally and will test it out tonight!

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 18, 2022

@jbkc85 Just curious, but how many repositories are intended to use the same set of organizational runners managed by your runners RunnerDeployment?

@jbkc85
Copy link
Author

jbkc85 commented Jan 19, 2022

@jbkc85 Just curious, but how many repositories are intended to use the same set of organizational runners managed by your runners RunnerDeployment?

Only one. However this one has a matrix that is generated that can have anywhere between 75 - 200+ tests associated with it. Furthermore, during the release, it actually has a matrix that is used on a second matrix....meaning that you have in parallel 125+ tests generally.

Very expensive testing suite and lots of dynamics :-)

@toast-gear
Copy link
Collaborator

@jbkc85 please do let us know how your testing goes, we're currently testing HEAD for a release so it would be great to know if real world uses also show no issues.

@jbkc85
Copy link
Author

jbkc85 commented Jan 19, 2022

@jbkc85 please do let us know how your testing goes, we're currently testing HEAD for a release so it would be great to know if real world uses also show no issues.

so far so good! The only issues we have ran into with tests today have been memory related - which I wish was easier to detect (as the runner controller will delete the pod before I can investigate and find out about the memory eviction). Will update after a full day of runs.

@jbkc85
Copy link
Author

jbkc85 commented Jan 19, 2022

So during a scale down event, we had the following scenario:

1.6426248285364957e+09	DEBUG	actions-runner-controller.runner	Rechecking the runner registration in 1m10.927349149s	{"runner": "default/runners-lsf87-qbmtw"}
1.6426248285458472e+09	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2022-01-19 20:41:28 +0000 UTC. Retrying in 58.454161511s at latest	{"runner": "default/runners-lsf87-qbmtw", "lastRegistrationCheckTime": "2022-01-19 20:40:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
1.6426248294468365e+09	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2022-01-19 20:41:28 +0000 UTC. Retrying in 57.55317141s at latest	{"runner": "default/runners-lsf87-qbmtw", "lastRegistrationCheckTime": "2022-01-19 20:40:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
1.6426248452326217e+09	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2022-01-19 20:41:28 +0000 UTC. Retrying in 41.767390462s at latest	{"runner": "default/runners-lsf87-qbmtw", "lastRegistrationCheckTime": "2022-01-19 20:40:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
1.6426249621067107e+09	INFO	actions-runner-controller.runner	Updated registration token	{"runner": "runners-lsf87-qbmtw", "repository": ""}
1.6426249621073768e+09	DEBUG	events	Normal	{"object": {"kind":"Runner","namespace":"default","name":"runners-lsf87-qbmtw","uid":"8b20e268-d9e2-44db-b9ae-9f6c79174a67","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"64949884"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}
1.6426250256327832e+09	DEBUG	events	Normal	{"object": {"kind":"RunnerReplicaSet","namespace":"default","name":"runners-lsf87","uid":"16cba2f6-1005-43cf-9949-e7a81816e153","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"64952993"}, "reason": "RunnerDeleted", "message": "Deleted runner 'runners-lsf87-qbmtw'"}
1.6426250268427994e+09	INFO	runner-resource	validate resource to be updated	{"name": "runners-lsf87-qbmtw"}
1.6426250268519974e+09	INFO	runner-resource	validate resource to be updated	{"name": "runners-lsf87-qbmtw"}
1.6426250268677664e+09	INFO	actions-runner-controller.runner	Removed runner from GitHub	{"runner": "default/runners-lsf87-qbmtw", "repository": "", "organization": "my_org"}
1.6426250274022748e+09	DEBUG	actions-runner-controller.runnerreplicaset	Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.	{"runnerreplicaset": "default/runners-lsf87", "runnerName": "runners-lsf87-qbmtw"}
1.6426250298669991e+09	DEBUG	events	Normal	{"object": {"kind":"RunnerReplicaSet","namespace":"default","name":"runners-lsf87","uid":"16cba2f6-1005-43cf-9949-e7a81816e153","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"64952993"}, "reason": "RunnerDeleted", "message": "Deleted runner 'runners-lsf87-qbmtw'"}

I am a little confused because this job was in the middle of running a test, and it also looked like there was a registration token just updated....yet it deleted the runner anyway. Any thoughts or explanation on the logic used here?

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 19, 2022

Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.

This means that it was unable to find the runner out of the results of "list self-hosted runners" API calls.
I thought you're now testing it with an unreleased version of ARC that does have pagination support for that API call, right?
Perhaps it's not working as intended, or you're using a wrong/outdated ARC image?

@jbkc85
Copy link
Author

jbkc85 commented Jan 20, 2022

Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.

This means that it was unable to find the runner out of the results of "list self-hosted runners" API calls. I thought you're now testing it with an unreleased version of ARC that does have pagination support for that API call, right? Perhaps it's not working as intended, or you're using a wrong/outdated ARC image?

I created my own image based off of what is currently in HEAD of master. So it does have the pagination support for the runners now.

Honestly I am not sure where the foul play is coming from just yet - though the following lines (https://github.com/actions-runner-controller/actions-runner-controller/blob/master/github/github.go#L147-L148) have me somewhat concerned:

	// we like to give runners a chance that are just starting up and may miss the expiration date by a bit
	runnerStartupTimeout := 3 * time.Minute

Basically in our environment, a brand new node is spun up to take care of a pod...this is generally pretty fast to pick up, but can still take 2-3 minutes from the runner 'scheduling' to the actual runner creation. This means that the registration token could already timeout right? If thats so I think I would like to extend this timeout to at least 5 minutes hoping to give runners a bit more time to spin up.

That being said, I am also looking into creating a struct creating an EvictionTimer around those items that are being marked for deletion or are being marked as 'safe to delete'. The reason being is if we look at a project like the autoscaling cluster controller, nodes marked as 'able to delete' have to be marked as such for at least 10 minutes. I think this project would benefit from taking a similar approach, and giving pods a timer which ultimately gives them a chance to avoid eviction/deletion if a second/third call to the GitHub API reveals that the pods are in fact still active.

Let me know your thoughts and I apologize if its a lot of babbling on!

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 20, 2022

Basically in our environment, a brand new node is spun up to take care of a pod...this is generally pretty fast to pick up, but can still take 2-3 minutes from the runner 'scheduling' to the actual runner creation. This means that the registration token could already timeout right?

That's correct but the 3-minute timeout there takes place only when the previously created runner token(not PAT or installation token) is about to expire, so that it refreshes the runner token 3 minutes before before it actually expires.

The thing that is very close to "runner startup timeout" for general cases is 10 minutes https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runner_controller.go#L296

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 20, 2022

giving pods a timer which ultimately gives them a chance to avoid eviction/deletion if a second/third call to the GitHub API reveals that the pods are in fact still active.

I think this is already handled by ARC(as it never tries to delete busy runners) and actions/runner(as it delays the runner agent termination until the WIP workflow job completes, if any).

Back to your original issue:

Suggested desired replicas of 8 by TotalNumberOfQueuedAndInProgressWorkflowRuns

If you think this specific number 8 is too low and doesn't reflect the actual number of queued and wip workflow runs, it might be due to pagination not working correctly, or the recalculation isn't happening frequently enough.

Regarding the latter, ARC does the recalculation only on each sync interval, which is configurable via --sync-period flag of ARC's controller-manager. If it's set to 10 seconds or so, it would be able to keep up with the demand pretty quickly, but you'll get GitHub API rate limit sooner depending on your scale.

If you've set it to 10 minutes or so, scale in/out can be delayed by approx 10 minutes but instead GitHub API rate limit is less of an issue.

For #1057 (comment), there might be a chance that the 10-minutes registration timeout is not long enough for you?
Unfortunately that timeout is hard-coded so you'd need to update code, give it a longer timeout like registrationTimeout := 20 * time.Minute to give it a try.

@jbkc85
Copy link
Author

jbkc85 commented Jan 20, 2022

If you think this specific number 8 is too low and doesn't reflect the actual number of queued and wip workflow runs, it might be due to pagination not working correctly, or the recalculation isn't happening frequently enough.

This was resolved because the code was in master but not in the latest 'release', so when I built an image off of master it worked as expected.

For #1057 (comment), there might be a chance that the 10-minutes registration timeout is not long enough for you?
Unfortunately that timeout is hard-coded so you'd need to update code, give it a longer timeout like registrationTimeout := 20 * time.Minute to give it a try.

The weird thing about this, and I can certainly jump over to that issue and comment given the original title of this issue is no longer a true 'issue' - there was no further log about the registrationDidTimeout - so theoretically the pod that can't be checked on whether or not it is busy should of never been thrown into the deletion candidates pile right?

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 20, 2022

there was no further log about the registrationDidTimeout - so theoretically the pod that can't be checked on whether or not it is busy should of never been thrown into the deletion candidates pile right?

I think it's the opposite? Currently, ARC deletes a runner once it failed to report its status for registrationTimeout, which is equivalent to registrationDidTimeout = true.

registrationTimeout is there for a good reason, so I don't think disabling or removing it at all isn't the way to go. It's there to prevent runners that failed in the middle of the initialization process to not hand forever in the K8s cluster. We did see such failure scenarios before (like a year ago or so) so we added it as a rescue.

That said, I guess it can be useful if registrationTimeout was configurable, via maybe per ARC installation or per runnerdeployment. WDYT?

@jbkc85
Copy link
Author

jbkc85 commented Jan 21, 2022

That said, I guess it can be useful if registrationTimeout was configurable, via maybe per ARC installation or per runnerdeployment. WDYT?

I would like that personally. I do think having it is a good idea, but I am also wondering now that I am spending more time dissecting this code.... is it smart to base the registrationTimeout against the creation of the pod? There are instances in the ARC controller/runner_controller.go that force a call to update registration token (https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runner_controller.go#L562) - which implies that the restart/recreation of a pod is not actually triggered....yet the reconciliation of pods uses the runner.CreationTimestamp to determine whether or not registration has timed out (https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runnerreplicaset_controller.go#L197). Doesn't this imply that a pod that maybe has been up for 15+ minutes through whatever means is subject to immediate reconciliation if it recently re-registered and has not updated its status on the GitHub API?

@stale
Copy link

stale bot commented Feb 20, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 20, 2022
@stale stale bot closed this as completed Mar 6, 2022
@mumoshu mumoshu reopened this Mar 6, 2022
@stale stale bot removed the stale label Mar 6, 2022
@toast-gear toast-gear added the pinned Misc issues / PRs we want to keep around label Mar 8, 2022
@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2022

@jbkc85 Since v0.22(the next, unreleased version of ARC), ARC will never proactively remove any runner pods whose runners aren't registered to GitHub yet.

It will first wait for the runner to register (or timeout) anyway, call the DeleteRunner API to prevent any job to be scheduled, and only after that it removes the runner pod. This way, there will be NO chances of job cancellation due to premature runner deletion.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2022

@jbkc85 It would work uniformly for both pull-based scale (e.g. TotalNubmerOfQueuedAndInProgressWorkflwoRuns) and webhook-based scale.

I'd appreciate it if you could test our canary version of unreleased ARC(it's tagged with canary in the container image registry)!

@toast-gear toast-gear removed the pinned Misc issues / PRs we want to keep around label Apr 6, 2022
@github-actions
Copy link
Contributor

github-actions bot commented May 7, 2022

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label May 7, 2022
@dennybaa
Copy link

Hello @mumoshu , please hold this one, it's really unclear

Bump into this as well v0.23.0 (orgwide):

spec:
  - type: PercentageRunnersBusy
    scaleDownAdjustment: 1
    scaleDownThreshold: "0.3"
    scaleUpAdjustment: 2
    scaleUpThreshold: "0.75"
    repositoryNames:
    - repo1
    - repo2
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - repo1
    - repo2

I have specifically validated gh api /repos/{owner}/{repo}/actions/runs all are in status == completed, however TotalNumberOfQueuedAndInProgressWorkflowRuns showing false number of desired pods!

NAME                 MIN   MAX   DESIRED   SCHEDULE
gha-ci-runners       0     15    1

@mumoshu
Copy link
Collaborator

mumoshu commented May 20, 2022

@dennybaa Hey! I have never tested enabling both PercentageRunnersBusy and TotalNumberOfQueuedAndInProgressWorkflowRuns at the same time. How would it work? 🤔

What's your goal? Are you trying to scale to/from zero and is that why you're saying DESIRED=1 isn't what you want when all the workflow runs are in status of completed? In that case, you should only have TotalNumberOfQueuedAndInProgressWorkflowRuns, as PercentageRunnersBusy does not support scale-from-zero by its nature.

Also, could you share the full RunnerDeployment and HorizontalRunnerAutoscaler YAMLs?

@github-actions github-actions bot removed the stale label May 20, 2022
@dennybaa
Copy link

dennybaa commented May 20, 2022

Hey @mumoshu,

Are you trying to scale to/from zero and is that why you're saying DESIRED=1 isn't what you want when all the workflow runs are in status of completed?

First of all very right, I do not expect DESIRED=1.

Elaborating further, initially it was only TotalNumberOfQueuedAndInProgressWorkflowRuns, but it had the same effect. As it's stated when there are two, only if the primary rule PercentageRunnersBusy gives 0 the secondary kicks in, and it indeed looks so.

My goal is to scale from 0 and evidently I don not expect to scale up if there are no jobs in progress and in queue, so the desired state should be 0. Am I missing anything here?

Obfuscated details:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  annotations:
    meta.helm.sh/release-name: gha-ci-runners
    meta.helm.sh/release-namespace: actions-runner-system
  creationTimestamp: "2022-05-19T19:04:07Z"
  generation: 32
  labels:
    app: raw
    app.kubernetes.io/managed-by: Helm
    chart: raw-v0.3.1
    heritage: Helm
    release: gha-ci-runners
  name: gha-ci-runners
  namespace: actions-runner-system
  resourceVersion: "483095083"
  uid: 7d83e488-53d7-4c9d-b0a1-7d4e3eecd47e
spec:
  replicas: 7
  selector: null
  template:
    metadata: {}
    spec:
      dockerMTU: 1460
      dockerdContainerResources: {}
      dockerdWithinRunnerContainer: true
      env:
      - name: DISABLE_RUNNER_UPDATE
        value: "true"
      image: foo/github-actions-runner:dind
      imagePullPolicy: Always
      labels:
      - ci-runner
      nodeSelector:
        node_pool: github-runner-pool
      organization: foo
      resources:
        requests:
          cpu: "28"
          memory: 64Gi
      serviceAccountName: gha-ci-runners
      tolerations:
      - effect: NoSchedule
        key: github-runner
        operator: Exists
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: gha-ci-runners
    meta.helm.sh/release-namespace: actions-runner-system
  creationTimestamp: "2022-05-19T19:04:07Z"
  generation: 2
  labels:
    app: raw
    app.kubernetes.io/managed-by: Helm
    chart: raw-v0.3.1
    heritage: Helm
    release: gha-ci-runners
  name: gha-ci-runners
  namespace: actions-runner-system
  resourceVersion: "483094998"
  uid: 8ec8cf42-344b-47c9-9f65-4b86ef11671c
spec:
  maxReplicas: 15
  metrics:
  - repositoryNames:
    - repo1
    - repo2
    scaleDownAdjustment: 1
    scaleDownThreshold: "0.3"
    scaleUpAdjustment: 2
    scaleUpThreshold: "0.75"
    type: PercentageRunnersBusy
  - repositoryNames:
    - repo1
    - repo2
    type: TotalNumberOfQueuedAndInProgressWorkflowRuns
  minReplicas: 0
  scaleTargetRef:
    name: gha-ci-runners

@dennybaa
Copy link

And this is really looks so through the night when are no ci jobs scheduled at all, note 7h53m a runner is always up:

test--github-runner-po-8cba8405-282w   Ready    <none>   7h53m   v1.21.10-gke.2000
test--github-runner-po-8cba8405-b4tx   Ready    <none>   18m     v1.21.10-gke.2000
test--github-runner-po-8cba8405-b5pr   Ready    <none>   20m     v1.21.10-gke.2000
test--github-runner-po-8cba8405-h8qd   Ready    <none>   24m     v1.21.10-gke.2000

@mumoshu
Copy link
Collaborator

mumoshu commented May 20, 2022

@dennybaa Thanks! Would you mind sharing the full manifest, without removing e.g. metadata.name, so that I'm more confident you've no mistakes in your config.

Also, it would be a good idea to review your controller logs, as it usually prints what the desired replicas ARC calculated, and why.

// At least, I can't figure it out with all the info you provided- All I can say at the moment is "if everything is correctly configured, and assuming your other settings and configurations (including the cloud provider you're using, the kind of your k8s cluster, EKS, AKS, GKE, self-managed, on-prem, auth method of either github app or token, and so on), it should work".

@dennybaa
Copy link

dennybaa commented May 20, 2022

@mumoshu Sure edited, the above post. Yup, exactly these two:

2022-05-20T08:53:00Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 0 by PercentageRunnersBusy	{"replicas_desired_before": 1, "replicas_desired": 0, "num_runners": 1, "num_runners_registered": 1, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:53:01Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 1 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 1, "workflow_runs_unknown": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners"}

@mumoshu
Copy link
Collaborator

mumoshu commented May 20, 2022

Thanks. Seems good so far. How about controller logs?

@dennybaa
Copy link

Here's a larger slice:

2022-05-20T08:54:11Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-ci-runners-9bq8b-lrfm2", "podCreationTimestamp": "2022-05-20 08:38:33 +0000 UTC"}
2022-05-20T08:54:53Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-default-stage", "horizontal_runner_autoscaler": "gha-default-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:54:53Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-default-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 9}
2022-05-20T08:54:53Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-deployer-stage", "horizontal_runner_autoscaler": "gha-deployer-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:54:53Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-deployer-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 6}
2022-05-20T08:54:53Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 0 by PercentageRunnersBusy	{"replicas_desired_before": 1, "replicas_desired": 0, "num_runners": 1, "num_runners_registered": 1, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:54:54Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 1 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 1, "workflow_runs_unknown": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners"}
2022-05-20T08:54:54Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 1	{"horizontalrunnerautoscaler": "actions-runner-system/gha-ci-runners", "suggested": 1, "reserved": 0, "min": 0, "max": 15}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-l57r9", "podCreationTimestamp": "2022-05-19 14:58:09 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-8pkkp", "podCreationTimestamp": "2022-05-19 23:24:48 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-kqlnf", "podCreationTimestamp": "2022-05-19 15:54:34 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-4fklb", "podCreationTimestamp": "2022-05-19 14:32:26 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-spf6l", "podCreationTimestamp": "2022-05-19 15:53:39 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-ci-runners-9bq8b-lrfm2", "podCreationTimestamp": "2022-05-20 08:38:33 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-r6vpv", "podCreationTimestamp": "2022-05-19 15:56:04 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-4fklb", "podCreationTimestamp": "2022-05-19 14:32:26 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-r6vpv", "podCreationTimestamp": "2022-05-19 15:56:04 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-ci-runners-9bq8b-lrfm2", "podCreationTimestamp": "2022-05-20 08:38:33 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-spf6l", "podCreationTimestamp": "2022-05-19 15:53:39 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-l57r9", "podCreationTimestamp": "2022-05-19 14:58:09 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-kqlnf", "podCreationTimestamp": "2022-05-19 15:54:34 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-8pkkp", "podCreationTimestamp": "2022-05-19 23:24:48 +0000 UTC"}
2022-05-20T08:56:46Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-default-stage", "horizontal_runner_autoscaler": "gha-default-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:56:46Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-default-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 9}
2022-05-20T08:56:47Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-deployer-stage", "horizontal_runner_autoscaler": "gha-deployer-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:56:47Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-deployer-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 6}
2022-05-20T08:56:47Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 0 by PercentageRunnersBusy	{"replicas_desired_before": 1, "replicas_desired": 0, "num_runners": 1, "num_runners_registered": 1, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:56:48Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 1 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 1, "workflow_runs_unknown": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners"}
2022-05-20T08:56:48Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 1	{"horizontalrunnerautoscaler": "actions-runner-system/gha-ci-runners", "suggested": 1, "reserved": 0, "min": 0, "max": 15}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-4fklb", "podCreationTimestamp": "2022-05-19 14:32:26 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-spf6l", "podCreationTimestamp": "2022-05-19 15:53:39 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-ci-runners-9bq8b-lrfm2", "podCreationTimestamp": "2022-05-20 08:38:33 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-r6vpv", "podCreationTimestamp": "2022-05-19 15:56:04 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-l57r9", "podCreationTimestamp": "2022-05-19 14:58:09 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-8pkkp", "podCreationTimestamp": "2022-05-19 23:24:48 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-kqlnf", "podCreationTimestamp": "2022-05-19 15:54:34 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-4fklb", "podCreationTimestamp": "2022-05-19 14:32:26 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-r6vpv", "podCreationTimestamp": "2022-05-19 15:56:04 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-ci-runners-9bq8b-lrfm2", "podCreationTimestamp": "2022-05-20 08:38:33 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-8pkkp", "podCreationTimestamp": "2022-05-19 23:24:48 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-spf6l", "podCreationTimestamp": "2022-05-19 15:53:39 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-l57r9", "podCreationTimestamp": "2022-05-19 14:58:09 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-kqlnf", "podCreationTimestamp": "2022-05-19 15:54:34 +0000 UTC"}
2022-05-20T08:58:40Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-default-stage", "horizontal_runner_autoscaler": "gha-default-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:58:40Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-default-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 9}
2022-05-20T08:58:40Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-deployer-stage", "horizontal_runner_autoscaler": "gha-deployer-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:58:40Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-deployer-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 6}
2022-05-20T08:58:41Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 0 by PercentageRunnersBusy	{"replicas_desired_before": 1, "replicas_desired": 0, "num_runners": 1, "num_runners_registered": 1, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:58:41Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 1 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 1, "workflow_runs_unknown": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners"}
2022-05-20T08:58:41Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 1	{"horizontalrunnerautoscaler": "actions-runner-system/gha-ci-runners", "suggested": 1, "reserved": 0, "min": 0, "max": 15}

@mumoshu
Copy link
Collaborator

mumoshu commented May 20, 2022

@dennybaa Thanks! So Suggested desired replicas of 1 by TotalNumberOfQueuedAndInProgressWorkflowRuns {"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 1, does seem to say that you still have a job that is hang in queued.

Have you ever tried checking responses from GitHub API related to it? I guess, either of list workflow runs or list workflow jobs will return a single job that is stuck in queued if paginated through.

@dennybaa
Copy link

Thank you @mumoshu! I have indeed checked thoroughly paginating through all of the pages and there was one workflow dangling, I deleted it and:

NAME                 MIN   MAX   DESIRED   SCHEDULE
gha-ci-runners       0     15    0 

Thank you again! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants