Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057

jbkc85 · 2022-01-18T19:25:43Z

Describe the bug
We are still battling the 'cancellation' bug in our organization, many times running into pod reconciles at times where it doesn't make sense. Typically the scale up works (tho since 0.20.1 its been scaling up tremendously slowly, if it even scales up to the proper amount of runners needed - more in details), but scaling down happens prematurely and cancels active workflow jobs. I have been able to test the 0.20.4 version but cancellations occur even more often than not with that release, and that includes being capped at 24 runners for some odd reason. I started monitoring the suggested replicas and am seeing a lot of oddities, which I am not yet able to map to any specific part of the code...

Checks

Example Log from Controller:

2022-01-18T19:15:05.548Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 8 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 8, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "default", "kind": "runnerdeployment", "name": "runners", "horizontal_runner_autoscaler": "runners-autoscaler"}

Example from CLI written w/ same code to grab workflow runs/jobs:

./airbin/checkctl github list_jobs
2022/01/18 13:15:08 [DEBUG] Total Count: 1 for Perform release (queued)
-----workflow stats for 1713612691-----Jobs Completed: 65Jobs Queued: 18Jobs In Progress: 46Jobs Unknown: 0

To Reproduce
Steps to reproduce the behavior:

Use TotalNumberOfQueuedAndInProgressWorkflowRuns
Use deployment from screenshots below
Launch workflow with 100+ jobs (in matrix if possible)

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  namespace: default
  name: runners
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      dockerdWithinRunnerContainer: true
      organization: myOrganization
      nodeSelector:
        com.example/capacity-type: SPOT
        com.example/purpose: github-runners
        com.example/environment: ops
      labels:
        - java
        - linux
        - eks
        - self-hosted
      resources:
        requests:
          cpu: "1.0"
          memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  namespace: default
  name: runners-autoscaler
spec:
  scaleTargetRef:
    name:runners
  scaleDownDelaySecondsAfterScaleOut: 7200
  minReplicas: 1
  maxReplicas: 150
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - my_repository

Environment (please complete the following information):

Controller Version [0.20.1->0.20.4]
Deployment Method: Helm

Additional context

This has been an ongoing issue but since the upgrade to the 0.20.x branch I believe its gotten worse. I am not sure if its the GitHub API or something in the calculations to determine the suggested replicas. Either way, is there anything I am missing here? I am not sure why I can get such different results on my own GitHub API calls from command line vs that of the controller's autoscaler suggested replicas.

scaleDownDelaySecondsAfterScaleOut: 7200 was put in to battle some of the unnecessary kills, it isn't working as well though because after a scale up and the runners stay at scale, it eventually causes runners to get interrupted (by controller kills) during the middle of tests as well.

The text was updated successfully, but these errors were encountered:

jbkc85 · 2022-01-18T22:29:30Z

I might be caught on the whole "this is fixed but not released" - as I see the pagination was added recently but isn't in release 0.20.4 yet. I built it locally and will test it out tonight!

mumoshu · 2022-01-18T23:58:55Z

@jbkc85 Just curious, but how many repositories are intended to use the same set of organizational runners managed by your runners RunnerDeployment?

jbkc85 · 2022-01-19T03:20:16Z

@jbkc85 Just curious, but how many repositories are intended to use the same set of organizational runners managed by your runners RunnerDeployment?

Only one. However this one has a matrix that is generated that can have anywhere between 75 - 200+ tests associated with it. Furthermore, during the release, it actually has a matrix that is used on a second matrix....meaning that you have in parallel 125+ tests generally.

Very expensive testing suite and lots of dynamics :-)

toast-gear · 2022-01-19T10:52:38Z

@jbkc85 please do let us know how your testing goes, we're currently testing HEAD for a release so it would be great to know if real world uses also show no issues.

jbkc85 · 2022-01-19T18:39:35Z

@jbkc85 please do let us know how your testing goes, we're currently testing HEAD for a release so it would be great to know if real world uses also show no issues.

so far so good! The only issues we have ran into with tests today have been memory related - which I wish was easier to detect (as the runner controller will delete the pod before I can investigate and find out about the memory eviction). Will update after a full day of runs.

jbkc85 · 2022-01-19T21:01:08Z

So during a scale down event, we had the following scenario:

1.6426248285364957e+09	DEBUG	actions-runner-controller.runner	Rechecking the runner registration in 1m10.927349149s	{"runner": "default/runners-lsf87-qbmtw"}
1.6426248285458472e+09	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2022-01-19 20:41:28 +0000 UTC. Retrying in 58.454161511s at latest	{"runner": "default/runners-lsf87-qbmtw", "lastRegistrationCheckTime": "2022-01-19 20:40:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
1.6426248294468365e+09	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2022-01-19 20:41:28 +0000 UTC. Retrying in 57.55317141s at latest	{"runner": "default/runners-lsf87-qbmtw", "lastRegistrationCheckTime": "2022-01-19 20:40:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
1.6426248452326217e+09	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2022-01-19 20:41:28 +0000 UTC. Retrying in 41.767390462s at latest	{"runner": "default/runners-lsf87-qbmtw", "lastRegistrationCheckTime": "2022-01-19 20:40:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
1.6426249621067107e+09	INFO	actions-runner-controller.runner	Updated registration token	{"runner": "runners-lsf87-qbmtw", "repository": ""}
1.6426249621073768e+09	DEBUG	events	Normal	{"object": {"kind":"Runner","namespace":"default","name":"runners-lsf87-qbmtw","uid":"8b20e268-d9e2-44db-b9ae-9f6c79174a67","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"64949884"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}
1.6426250256327832e+09	DEBUG	events	Normal	{"object": {"kind":"RunnerReplicaSet","namespace":"default","name":"runners-lsf87","uid":"16cba2f6-1005-43cf-9949-e7a81816e153","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"64952993"}, "reason": "RunnerDeleted", "message": "Deleted runner 'runners-lsf87-qbmtw'"}
1.6426250268427994e+09	INFO	runner-resource	validate resource to be updated	{"name": "runners-lsf87-qbmtw"}
1.6426250268519974e+09	INFO	runner-resource	validate resource to be updated	{"name": "runners-lsf87-qbmtw"}
1.6426250268677664e+09	INFO	actions-runner-controller.runner	Removed runner from GitHub	{"runner": "default/runners-lsf87-qbmtw", "repository": "", "organization": "my_org"}
1.6426250274022748e+09	DEBUG	actions-runner-controller.runnerreplicaset	Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.	{"runnerreplicaset": "default/runners-lsf87", "runnerName": "runners-lsf87-qbmtw"}
1.6426250298669991e+09	DEBUG	events	Normal	{"object": {"kind":"RunnerReplicaSet","namespace":"default","name":"runners-lsf87","uid":"16cba2f6-1005-43cf-9949-e7a81816e153","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"64952993"}, "reason": "RunnerDeleted", "message": "Deleted runner 'runners-lsf87-qbmtw'"}

I am a little confused because this job was in the middle of running a test, and it also looked like there was a registration token just updated....yet it deleted the runner anyway. Any thoughts or explanation on the logic used here?

mumoshu · 2022-01-19T23:23:30Z

Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.

This means that it was unable to find the runner out of the results of "list self-hosted runners" API calls.
I thought you're now testing it with an unreleased version of ARC that does have pagination support for that API call, right?
Perhaps it's not working as intended, or you're using a wrong/outdated ARC image?

jbkc85 · 2022-01-20T05:52:09Z

Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.

This means that it was unable to find the runner out of the results of "list self-hosted runners" API calls. I thought you're now testing it with an unreleased version of ARC that does have pagination support for that API call, right? Perhaps it's not working as intended, or you're using a wrong/outdated ARC image?

I created my own image based off of what is currently in HEAD of master. So it does have the pagination support for the runners now.

Honestly I am not sure where the foul play is coming from just yet - though the following lines (https://github.com/actions-runner-controller/actions-runner-controller/blob/master/github/github.go#L147-L148) have me somewhat concerned:

	// we like to give runners a chance that are just starting up and may miss the expiration date by a bit
	runnerStartupTimeout := 3 * time.Minute

Basically in our environment, a brand new node is spun up to take care of a pod...this is generally pretty fast to pick up, but can still take 2-3 minutes from the runner 'scheduling' to the actual runner creation. This means that the registration token could already timeout right? If thats so I think I would like to extend this timeout to at least 5 minutes hoping to give runners a bit more time to spin up.

That being said, I am also looking into creating a struct creating an EvictionTimer around those items that are being marked for deletion or are being marked as 'safe to delete'. The reason being is if we look at a project like the autoscaling cluster controller, nodes marked as 'able to delete' have to be marked as such for at least 10 minutes. I think this project would benefit from taking a similar approach, and giving pods a timer which ultimately gives them a chance to avoid eviction/deletion if a second/third call to the GitHub API reveals that the pods are in fact still active.

Let me know your thoughts and I apologize if its a lot of babbling on!

mumoshu · 2022-01-20T08:28:33Z

Basically in our environment, a brand new node is spun up to take care of a pod...this is generally pretty fast to pick up, but can still take 2-3 minutes from the runner 'scheduling' to the actual runner creation. This means that the registration token could already timeout right?

That's correct but the 3-minute timeout there takes place only when the previously created runner token(not PAT or installation token) is about to expire, so that it refreshes the runner token 3 minutes before before it actually expires.

The thing that is very close to "runner startup timeout" for general cases is 10 minutes https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runner_controller.go#L296

mumoshu · 2022-01-20T08:38:05Z

giving pods a timer which ultimately gives them a chance to avoid eviction/deletion if a second/third call to the GitHub API reveals that the pods are in fact still active.

I think this is already handled by ARC(as it never tries to delete busy runners) and actions/runner(as it delays the runner agent termination until the WIP workflow job completes, if any).

Back to your original issue:

Suggested desired replicas of 8 by TotalNumberOfQueuedAndInProgressWorkflowRuns

If you think this specific number 8 is too low and doesn't reflect the actual number of queued and wip workflow runs, it might be due to pagination not working correctly, or the recalculation isn't happening frequently enough.

Regarding the latter, ARC does the recalculation only on each sync interval, which is configurable via --sync-period flag of ARC's controller-manager. If it's set to 10 seconds or so, it would be able to keep up with the demand pretty quickly, but you'll get GitHub API rate limit sooner depending on your scale.

If you've set it to 10 minutes or so, scale in/out can be delayed by approx 10 minutes but instead GitHub API rate limit is less of an issue.

For #1057 (comment), there might be a chance that the 10-minutes registration timeout is not long enough for you?
Unfortunately that timeout is hard-coded so you'd need to update code, give it a longer timeout like registrationTimeout := 20 * time.Minute to give it a try.

jbkc85 · 2022-01-20T15:46:58Z

If you think this specific number 8 is too low and doesn't reflect the actual number of queued and wip workflow runs, it might be due to pagination not working correctly, or the recalculation isn't happening frequently enough.

This was resolved because the code was in master but not in the latest 'release', so when I built an image off of master it worked as expected.

For #1057 (comment), there might be a chance that the 10-minutes registration timeout is not long enough for you?
Unfortunately that timeout is hard-coded so you'd need to update code, give it a longer timeout like registrationTimeout := 20 * time.Minute to give it a try.

The weird thing about this, and I can certainly jump over to that issue and comment given the original title of this issue is no longer a true 'issue' - there was no further log about the registrationDidTimeout - so theoretically the pod that can't be checked on whether or not it is busy should of never been thrown into the deletion candidates pile right?

mumoshu · 2022-01-20T23:58:32Z

there was no further log about the registrationDidTimeout - so theoretically the pod that can't be checked on whether or not it is busy should of never been thrown into the deletion candidates pile right?

I think it's the opposite? Currently, ARC deletes a runner once it failed to report its status for registrationTimeout, which is equivalent to registrationDidTimeout = true.

registrationTimeout is there for a good reason, so I don't think disabling or removing it at all isn't the way to go. It's there to prevent runners that failed in the middle of the initialization process to not hand forever in the K8s cluster. We did see such failure scenarios before (like a year ago or so) so we added it as a rescue.

That said, I guess it can be useful if registrationTimeout was configurable, via maybe per ARC installation or per runnerdeployment. WDYT?

jbkc85 · 2022-01-21T20:56:09Z

That said, I guess it can be useful if registrationTimeout was configurable, via maybe per ARC installation or per runnerdeployment. WDYT?

I would like that personally. I do think having it is a good idea, but I am also wondering now that I am spending more time dissecting this code.... is it smart to base the registrationTimeout against the creation of the pod? There are instances in the ARC controller/runner_controller.go that force a call to update registration token (https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runner_controller.go#L562) - which implies that the restart/recreation of a pod is not actually triggered....yet the reconciliation of pods uses the runner.CreationTimestamp to determine whether or not registration has timed out (https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runnerreplicaset_controller.go#L197). Doesn't this imply that a pod that maybe has been up for 15+ minutes through whatever means is subject to immediate reconciliation if it recently re-registered and has not updated its status on the GitHub API?

stale · 2022-02-20T20:56:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mumoshu · 2022-03-09T00:26:56Z

@jbkc85 Since v0.22(the next, unreleased version of ARC), ARC will never proactively remove any runner pods whose runners aren't registered to GitHub yet.

It will first wait for the runner to register (or timeout) anyway, call the DeleteRunner API to prevent any job to be scheduled, and only after that it removes the runner pod. This way, there will be NO chances of job cancellation due to premature runner deletion.

mumoshu · 2022-03-09T00:29:11Z

@jbkc85 It would work uniformly for both pull-based scale (e.g. TotalNubmerOfQueuedAndInProgressWorkflwoRuns) and webhook-based scale.

I'd appreciate it if you could test our canary version of unreleased ARC(it's tagged with canary in the container image registry)!

github-actions · 2022-05-07T02:05:30Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

dennybaa · 2022-05-19T19:14:41Z

Hello @mumoshu , please hold this one, it's really unclear

Bump into this as well v0.23.0 (orgwide):

spec:
  - type: PercentageRunnersBusy
    scaleDownAdjustment: 1
    scaleDownThreshold: "0.3"
    scaleUpAdjustment: 2
    scaleUpThreshold: "0.75"
    repositoryNames:
    - repo1
    - repo2
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - repo1
    - repo2

I have specifically validated gh api /repos/{owner}/{repo}/actions/runs all are in status == completed, however TotalNumberOfQueuedAndInProgressWorkflowRuns showing false number of desired pods!

NAME                 MIN   MAX   DESIRED   SCHEDULE
gha-ci-runners       0     15    1

mumoshu · 2022-05-20T01:18:34Z

@dennybaa Hey! I have never tested enabling both PercentageRunnersBusy and TotalNumberOfQueuedAndInProgressWorkflowRuns at the same time. How would it work? 🤔

What's your goal? Are you trying to scale to/from zero and is that why you're saying DESIRED=1 isn't what you want when all the workflow runs are in status of completed? In that case, you should only have TotalNumberOfQueuedAndInProgressWorkflowRuns, as PercentageRunnersBusy does not support scale-from-zero by its nature.

Also, could you share the full RunnerDeployment and HorizontalRunnerAutoscaler YAMLs?

dennybaa · 2022-05-20T07:09:11Z

Hey @mumoshu,

Are you trying to scale to/from zero and is that why you're saying DESIRED=1 isn't what you want when all the workflow runs are in status of completed?

First of all very right, I do not expect DESIRED=1.

Elaborating further, initially it was only TotalNumberOfQueuedAndInProgressWorkflowRuns, but it had the same effect. As it's stated when there are two, only if the primary rule PercentageRunnersBusy gives 0 the secondary kicks in, and it indeed looks so.

My goal is to scale from 0 and evidently I don not expect to scale up if there are no jobs in progress and in queue, so the desired state should be 0. Am I missing anything here?

Obfuscated details:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  annotations:
    meta.helm.sh/release-name: gha-ci-runners
    meta.helm.sh/release-namespace: actions-runner-system
  creationTimestamp: "2022-05-19T19:04:07Z"
  generation: 32
  labels:
    app: raw
    app.kubernetes.io/managed-by: Helm
    chart: raw-v0.3.1
    heritage: Helm
    release: gha-ci-runners
  name: gha-ci-runners
  namespace: actions-runner-system
  resourceVersion: "483095083"
  uid: 7d83e488-53d7-4c9d-b0a1-7d4e3eecd47e
spec:
  replicas: 7
  selector: null
  template:
    metadata: {}
    spec:
      dockerMTU: 1460
      dockerdContainerResources: {}
      dockerdWithinRunnerContainer: true
      env:
      - name: DISABLE_RUNNER_UPDATE
        value: "true"
      image: foo/github-actions-runner:dind
      imagePullPolicy: Always
      labels:
      - ci-runner
      nodeSelector:
        node_pool: github-runner-pool
      organization: foo
      resources:
        requests:
          cpu: "28"
          memory: 64Gi
      serviceAccountName: gha-ci-runners
      tolerations:
      - effect: NoSchedule
        key: github-runner
        operator: Exists

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: gha-ci-runners
    meta.helm.sh/release-namespace: actions-runner-system
  creationTimestamp: "2022-05-19T19:04:07Z"
  generation: 2
  labels:
    app: raw
    app.kubernetes.io/managed-by: Helm
    chart: raw-v0.3.1
    heritage: Helm
    release: gha-ci-runners
  name: gha-ci-runners
  namespace: actions-runner-system
  resourceVersion: "483094998"
  uid: 8ec8cf42-344b-47c9-9f65-4b86ef11671c
spec:
  maxReplicas: 15
  metrics:
  - repositoryNames:
    - repo1
    - repo2
    scaleDownAdjustment: 1
    scaleDownThreshold: "0.3"
    scaleUpAdjustment: 2
    scaleUpThreshold: "0.75"
    type: PercentageRunnersBusy
  - repositoryNames:
    - repo1
    - repo2
    type: TotalNumberOfQueuedAndInProgressWorkflowRuns
  minReplicas: 0
  scaleTargetRef:
    name: gha-ci-runners

dennybaa · 2022-05-20T07:21:34Z

And this is really looks so through the night when are no ci jobs scheduled at all, note 7h53m a runner is always up:

test--github-runner-po-8cba8405-282w   Ready    <none>   7h53m   v1.21.10-gke.2000
test--github-runner-po-8cba8405-b4tx   Ready    <none>   18m     v1.21.10-gke.2000
test--github-runner-po-8cba8405-b5pr   Ready    <none>   20m     v1.21.10-gke.2000
test--github-runner-po-8cba8405-h8qd   Ready    <none>   24m     v1.21.10-gke.2000

mumoshu · 2022-05-20T07:59:59Z

@dennybaa Thanks! Would you mind sharing the full manifest, without removing e.g. metadata.name, so that I'm more confident you've no mistakes in your config.

Also, it would be a good idea to review your controller logs, as it usually prints what the desired replicas ARC calculated, and why.

// At least, I can't figure it out with all the info you provided- All I can say at the moment is "if everything is correctly configured, and assuming your other settings and configurations (including the cloud provider you're using, the kind of your k8s cluster, EKS, AKS, GKE, self-managed, on-prem, auth method of either github app or token, and so on), it should work".

dennybaa · 2022-05-20T08:55:16Z

@mumoshu Sure edited, the above post. Yup, exactly these two:

2022-05-20T08:53:00Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 0 by PercentageRunnersBusy	{"replicas_desired_before": 1, "replicas_desired": 0, "num_runners": 1, "num_runners_registered": 1, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:53:01Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 1 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 1, "workflow_runs_unknown": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners"}

mumoshu · 2022-05-20T08:57:37Z

Thanks. Seems good so far. How about controller logs?

dennybaa · 2022-05-20T08:59:37Z

Here's a larger slice:

2022-05-20T08:54:11Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-ci-runners-9bq8b-lrfm2", "podCreationTimestamp": "2022-05-20 08:38:33 +0000 UTC"}
2022-05-20T08:54:53Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-default-stage", "horizontal_runner_autoscaler": "gha-default-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:54:53Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-default-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 9}
2022-05-20T08:54:53Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-deployer-stage", "horizontal_runner_autoscaler": "gha-deployer-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:54:53Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-deployer-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 6}
2022-05-20T08:54:53Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 0 by PercentageRunnersBusy	{"replicas_desired_before": 1, "replicas_desired": 0, "num_runners": 1, "num_runners_registered": 1, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:54:54Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 1 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 1, "workflow_runs_unknown": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners"}
2022-05-20T08:54:54Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 1	{"horizontalrunnerautoscaler": "actions-runner-system/gha-ci-runners", "suggested": 1, "reserved": 0, "min": 0, "max": 15}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-l57r9", "podCreationTimestamp": "2022-05-19 14:58:09 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-8pkkp", "podCreationTimestamp": "2022-05-19 23:24:48 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-kqlnf", "podCreationTimestamp": "2022-05-19 15:54:34 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-4fklb", "podCreationTimestamp": "2022-05-19 14:32:26 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-spf6l", "podCreationTimestamp": "2022-05-19 15:53:39 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-ci-runners-9bq8b-lrfm2", "podCreationTimestamp": "2022-05-20 08:38:33 +0000 UTC"}
2022-05-20T08:55:48Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-r6vpv", "podCreationTimestamp": "2022-05-19 15:56:04 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-4fklb", "podCreationTimestamp": "2022-05-19 14:32:26 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-r6vpv", "podCreationTimestamp": "2022-05-19 15:56:04 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-ci-runners-9bq8b-lrfm2", "podCreationTimestamp": "2022-05-20 08:38:33 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-spf6l", "podCreationTimestamp": "2022-05-19 15:53:39 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-l57r9", "podCreationTimestamp": "2022-05-19 14:58:09 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-kqlnf", "podCreationTimestamp": "2022-05-19 15:54:34 +0000 UTC"}
2022-05-20T08:56:08Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-8pkkp", "podCreationTimestamp": "2022-05-19 23:24:48 +0000 UTC"}
2022-05-20T08:56:46Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-default-stage", "horizontal_runner_autoscaler": "gha-default-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:56:46Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-default-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 9}
2022-05-20T08:56:47Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-deployer-stage", "horizontal_runner_autoscaler": "gha-deployer-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:56:47Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-deployer-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 6}
2022-05-20T08:56:47Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 0 by PercentageRunnersBusy	{"replicas_desired_before": 1, "replicas_desired": 0, "num_runners": 1, "num_runners_registered": 1, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:56:48Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 1 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 1, "workflow_runs_unknown": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners"}
2022-05-20T08:56:48Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 1	{"horizontalrunnerautoscaler": "actions-runner-system/gha-ci-runners", "suggested": 1, "reserved": 0, "min": 0, "max": 15}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-4fklb", "podCreationTimestamp": "2022-05-19 14:32:26 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-spf6l", "podCreationTimestamp": "2022-05-19 15:53:39 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-ci-runners-9bq8b-lrfm2", "podCreationTimestamp": "2022-05-20 08:38:33 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-r6vpv", "podCreationTimestamp": "2022-05-19 15:56:04 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-l57r9", "podCreationTimestamp": "2022-05-19 14:58:09 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-8pkkp", "podCreationTimestamp": "2022-05-19 23:24:48 +0000 UTC"}
2022-05-20T08:57:55Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-kqlnf", "podCreationTimestamp": "2022-05-19 15:54:34 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-4fklb", "podCreationTimestamp": "2022-05-19 14:32:26 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-r6vpv", "podCreationTimestamp": "2022-05-19 15:56:04 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-ci-runners-9bq8b-lrfm2", "podCreationTimestamp": "2022-05-20 08:38:33 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-8pkkp", "podCreationTimestamp": "2022-05-19 23:24:48 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-spf6l", "podCreationTimestamp": "2022-05-19 15:53:39 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-default-stage-2fr8n-l57r9", "podCreationTimestamp": "2022-05-19 14:58:09 +0000 UTC"}
2022-05-20T08:58:05Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/gha-deployer-stage-plcwj-kqlnf", "podCreationTimestamp": "2022-05-19 15:54:34 +0000 UTC"}
2022-05-20T08:58:40Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-default-stage", "horizontal_runner_autoscaler": "gha-default-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:58:40Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-default-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 9}
2022-05-20T08:58:40Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 2 by PercentageRunnersBusy	{"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-deployer-stage", "horizontal_runner_autoscaler": "gha-deployer-stage", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:58:40Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 3	{"horizontalrunnerautoscaler": "actions-runner-system/gha-deployer-stage", "suggested": 2, "reserved": 0, "min": 3, "max": 6}
2022-05-20T08:58:41Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 0 by PercentageRunnersBusy	{"replicas_desired_before": 1, "replicas_desired": 0, "num_runners": 1, "num_runners_registered": 1, "num_runners_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners", "enterprise": "", "organization": "xxx", "repository": ""}
2022-05-20T08:58:41Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 1 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 1, "workflow_runs_unknown": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "gha-ci-runners", "horizontal_runner_autoscaler": "gha-ci-runners"}
2022-05-20T08:58:41Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 1	{"horizontalrunnerautoscaler": "actions-runner-system/gha-ci-runners", "suggested": 1, "reserved": 0, "min": 0, "max": 15}

mumoshu · 2022-05-20T22:37:52Z

@dennybaa Thanks! So Suggested desired replicas of 1 by TotalNumberOfQueuedAndInProgressWorkflowRuns {"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 1, does seem to say that you still have a job that is hang in queued.

Have you ever tried checking responses from GitHub API related to it? I guess, either of list workflow runs or list workflow jobs will return a single job that is stuck in queued if paginated through.

dennybaa · 2022-05-21T06:06:24Z

Thank you @mumoshu! I have indeed checked thoroughly paginating through all of the pages and there was one workflow dangling, I deleted it and:

NAME                 MIN   MAX   DESIRED   SCHEDULE
gha-ci-runners       0     15    0

Thank you again! 🙏

Ref #1057 (comment)

stale bot added the stale label Feb 20, 2022

stale bot closed this as completed Mar 6, 2022

mumoshu reopened this Mar 6, 2022

stale bot removed the stale label Mar 6, 2022

toast-gear added the pinned Misc issues / PRs we want to keep around label Mar 8, 2022

toast-gear removed the pinned Misc issues / PRs we want to keep around label Apr 6, 2022

github-actions bot added the stale label May 7, 2022

github-actions bot removed the stale label May 20, 2022

mumoshu added a commit that referenced this issue May 21, 2022

doc: Enhance troubleshooting guide with the scale-to-zero issue

90f16c9

Ref #1057 (comment)

mumoshu mentioned this issue May 21, 2022

doc: Enhance troubleshooting guide with the scale-to-zero issue #1469

Merged

toast-gear pushed a commit that referenced this issue May 21, 2022

doc: enhance troubleshooting guide with the scale-to-zero issue (#1469)

5bbea77

Ref #1057 (comment)

toast-gear closed this as completed Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057

Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057

jbkc85 commented Jan 18, 2022 •

edited

Loading

jbkc85 commented Jan 18, 2022

mumoshu commented Jan 18, 2022

jbkc85 commented Jan 19, 2022

toast-gear commented Jan 19, 2022

jbkc85 commented Jan 19, 2022

jbkc85 commented Jan 19, 2022

mumoshu commented Jan 19, 2022 •

edited

Loading

jbkc85 commented Jan 20, 2022

mumoshu commented Jan 20, 2022

mumoshu commented Jan 20, 2022

jbkc85 commented Jan 20, 2022

mumoshu commented Jan 20, 2022

jbkc85 commented Jan 21, 2022

stale bot commented Feb 20, 2022

mumoshu commented Mar 9, 2022 •

edited

Loading

mumoshu commented Mar 9, 2022

github-actions bot commented May 7, 2022

dennybaa commented May 19, 2022

mumoshu commented May 20, 2022 •

edited

Loading

dennybaa commented May 20, 2022 •

edited

Loading

dennybaa commented May 20, 2022

mumoshu commented May 20, 2022 •

edited

Loading

dennybaa commented May 20, 2022 •

edited

Loading

mumoshu commented May 20, 2022

dennybaa commented May 20, 2022

mumoshu commented May 20, 2022 •

edited

Loading

dennybaa commented May 21, 2022

Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057

Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057

Comments

jbkc85 commented Jan 18, 2022 • edited Loading

jbkc85 commented Jan 18, 2022

mumoshu commented Jan 18, 2022

jbkc85 commented Jan 19, 2022

toast-gear commented Jan 19, 2022

jbkc85 commented Jan 19, 2022

jbkc85 commented Jan 19, 2022

mumoshu commented Jan 19, 2022 • edited Loading

jbkc85 commented Jan 20, 2022

mumoshu commented Jan 20, 2022

mumoshu commented Jan 20, 2022

jbkc85 commented Jan 20, 2022

mumoshu commented Jan 20, 2022

jbkc85 commented Jan 21, 2022

stale bot commented Feb 20, 2022

mumoshu commented Mar 9, 2022 • edited Loading

mumoshu commented Mar 9, 2022

github-actions bot commented May 7, 2022

dennybaa commented May 19, 2022

mumoshu commented May 20, 2022 • edited Loading

dennybaa commented May 20, 2022 • edited Loading

dennybaa commented May 20, 2022

mumoshu commented May 20, 2022 • edited Loading

dennybaa commented May 20, 2022 • edited Loading

mumoshu commented May 20, 2022

dennybaa commented May 20, 2022

mumoshu commented May 20, 2022 • edited Loading

dennybaa commented May 21, 2022

jbkc85 commented Jan 18, 2022 •

edited

Loading

mumoshu commented Jan 19, 2022 •

edited

Loading

mumoshu commented Mar 9, 2022 •

edited

Loading

mumoshu commented May 20, 2022 •

edited

Loading

dennybaa commented May 20, 2022 •

edited

Loading

mumoshu commented May 20, 2022 •

edited

Loading

dennybaa commented May 20, 2022 •

edited

Loading

mumoshu commented May 20, 2022 •

edited

Loading