-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057
Comments
I might be caught on the whole "this is fixed but not released" - as I see the pagination was added recently but isn't in release 0.20.4 yet. I built it locally and will test it out tonight! |
@jbkc85 Just curious, but how many repositories are intended to use the same set of organizational runners managed by your |
Only one. However this one has a matrix that is generated that can have anywhere between 75 - 200+ tests associated with it. Furthermore, during the release, it actually has a matrix that is used on a second matrix....meaning that you have in parallel 125+ tests generally. Very expensive testing suite and lots of dynamics :-) |
@jbkc85 please do let us know how your testing goes, we're currently testing HEAD for a release so it would be great to know if real world uses also show no issues. |
so far so good! The only issues we have ran into with tests today have been memory related - which I wish was easier to detect (as the runner controller will delete the pod before I can investigate and find out about the memory eviction). Will update after a full day of runs. |
So during a
I am a little confused because this job was in the middle of running a test, and it also looked like there was a registration token just updated....yet it deleted the runner anyway. Any thoughts or explanation on the logic used here? |
This means that it was unable to find the runner out of the results of "list self-hosted runners" API calls. |
I created my own image based off of what is currently in Honestly I am not sure where the foul play is coming from just yet - though the following lines (https://github.com/actions-runner-controller/actions-runner-controller/blob/master/github/github.go#L147-L148) have me somewhat concerned:
Basically in our environment, a brand new node is spun up to take care of a pod...this is generally pretty fast to pick up, but can still take 2-3 minutes from the runner 'scheduling' to the actual runner creation. This means that the registration token could already timeout right? If thats so I think I would like to extend this timeout to at least 5 minutes hoping to give runners a bit more time to spin up. That being said, I am also looking into creating a struct creating an Let me know your thoughts and I apologize if its a lot of babbling on! |
That's correct but the 3-minute timeout there takes place only when the previously created runner token(not PAT or installation token) is about to expire, so that it refreshes the runner token 3 minutes before before it actually expires. The thing that is very close to "runner startup timeout" for general cases is 10 minutes https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runner_controller.go#L296 |
I think this is already handled by ARC(as it never tries to delete busy runners) and actions/runner(as it delays the runner agent termination until the WIP workflow job completes, if any). Back to your original issue:
If you think this specific number Regarding the latter, ARC does the recalculation only on each sync interval, which is configurable via If you've set it to 10 minutes or so, scale in/out can be delayed by approx 10 minutes but instead GitHub API rate limit is less of an issue. For #1057 (comment), there might be a chance that the 10-minutes registration timeout is not long enough for you? |
This was resolved because the code was in master but not in the latest 'release', so when I built an image off of master it worked as expected.
The weird thing about this, and I can certainly jump over to that issue and comment given the original title of this issue is no longer a true 'issue' - there was no further log about the |
I think it's the opposite? Currently, ARC deletes a runner once it failed to report its status for
That said, I guess it can be useful if |
I would like that personally. I do think having it is a good idea, but I am also wondering now that I am spending more time dissecting this code.... is it smart to base the registrationTimeout against the creation of the pod? There are instances in the ARC controller/runner_controller.go that force a call to update registration token (https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runner_controller.go#L562) - which implies that the restart/recreation of a pod is not actually triggered....yet the reconciliation of pods uses the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@jbkc85 Since v0.22(the next, unreleased version of ARC), ARC will never proactively remove any runner pods whose runners aren't registered to GitHub yet. It will first wait for the runner to register (or timeout) anyway, call the DeleteRunner API to prevent any job to be scheduled, and only after that it removes the runner pod. This way, there will be NO chances of job cancellation due to premature runner deletion. |
@jbkc85 It would work uniformly for both pull-based scale (e.g. TotalNubmerOfQueuedAndInProgressWorkflwoRuns) and webhook-based scale. I'd appreciate it if you could test our canary version of unreleased ARC(it's tagged with |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Hello @mumoshu , please hold this one, it's really unclear Bump into this as well v0.23.0 (orgwide): spec:
- type: PercentageRunnersBusy
scaleDownAdjustment: 1
scaleDownThreshold: "0.3"
scaleUpAdjustment: 2
scaleUpThreshold: "0.75"
repositoryNames:
- repo1
- repo2
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- repo1
- repo2 I have specifically validated
|
@dennybaa Hey! I have never tested enabling both PercentageRunnersBusy and TotalNumberOfQueuedAndInProgressWorkflowRuns at the same time. How would it work? 🤔 What's your goal? Are you trying to scale to/from zero and is that why you're saying Also, could you share the full RunnerDeployment and HorizontalRunnerAutoscaler YAMLs? |
Hey @mumoshu,
First of all very right, I do not expect Elaborating further, initially it was only TotalNumberOfQueuedAndInProgressWorkflowRuns, but it had the same effect. As it's stated when there are two, only if the primary rule PercentageRunnersBusy gives 0 the secondary kicks in, and it indeed looks so. My goal is to scale from 0 and evidently I don not expect to scale up if there are no jobs in progress and in queue, so the desired state should be 0. Am I missing anything here? Obfuscated details:
|
And this is really looks so through the night when are no ci jobs scheduled at all, note
|
@dennybaa Thanks! Would you mind sharing the full manifest, without removing e.g. metadata.name, so that I'm more confident you've no mistakes in your config. Also, it would be a good idea to review your controller logs, as it usually prints what the desired replicas ARC calculated, and why. // At least, I can't figure it out with all the info you provided- All I can say at the moment is "if everything is correctly configured, and assuming your other settings and configurations (including the cloud provider you're using, the kind of your k8s cluster, EKS, AKS, GKE, self-managed, on-prem, auth method of either github app or token, and so on), it should work". |
@mumoshu Sure edited, the above post. Yup, exactly these two:
|
Thanks. Seems good so far. How about controller logs? |
Here's a larger slice:
|
@dennybaa Thanks! So Have you ever tried checking responses from GitHub API related to it? I guess, either of |
Thank you @mumoshu! I have indeed checked thoroughly paginating through all of the pages and there was one workflow dangling, I deleted it and:
Thank you again! 🙏 |
Describe the bug
We are still battling the 'cancellation' bug in our organization, many times running into pod reconciles at times where it doesn't make sense. Typically the scale up works (tho since 0.20.1 its been scaling up tremendously slowly, if it even scales up to the proper amount of runners needed - more in details), but scaling down happens prematurely and cancels active workflow jobs. I have been able to test the 0.20.4 version but cancellations occur even more often than not with that release, and that includes being capped at 24 runners for some odd reason. I started monitoring the suggested replicas and am seeing a lot of oddities, which I am not yet able to map to any specific part of the code...
Checks
Example Log from Controller:
Example from CLI written w/ same code to grab workflow runs/jobs:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
Environment (please complete the following information):
Additional context
This has been an ongoing issue but since the upgrade to the
0.20.x
branch I believe its gotten worse. I am not sure if its the GitHub API or something in the calculations to determine the suggested replicas. Either way, is there anything I am missing here? I am not sure why I can get such different results on my own GitHub API calls from command line vs that of the controller's autoscaler suggested replicas.scaleDownDelaySecondsAfterScaleOut: 7200
was put in to battle some of the unnecessary kills, it isn't working as well though because after a scale up and the runners stay at scale, it eventually causes runners to get interrupted (by controller kills) during the middle of tests as well.The text was updated successfully, but these errors were encountered: