Add GitHub API cache to avoid rate limit #1127

mumoshu · 2022-02-17T02:21:54Z

Enhances ARC(both the controller-manager and github-webhook-server) to cache any GitHub API responses with HTTP GET and an appropriate Cache-Control header.

Ref #920

Cache Implementation

gregjones/httpcache has been chosen as a library to implement this feature, as it is as recommended in go-github's documentation:

https://github.com/google/go-github#conditional-requests

gregjones/httpcache supports a number of cache backends like diskcache, s3cache, and so on:

https://github.com/gregjones/httpcache#cache-backends

We stick to the built-in in-memory cache as a starter. Probably this will never becomes an issue as long as various HTTP responses for all the GitHub API calls that ARC makes, list-runners, list-workflow-jobs, list-runner-groups, etc., doesn't overflow the in-memory cache.

httpcache has an known unfixed issue that it doesn't update cache on chunked responses. But we assume that the APIs that we call doesn't use chunked responses. See #1503 for more information on that.

Ephemeral runner pods are no longer recreated

The addition of the cache layer resulted in a slow down of a scale-down process and a trade-off between making the runner pod termination process fragile to various race conditions(shorter grace period before runner deletion) or delaying runner pod deletion depending on how long the grace period is(longer grace period). A grace period needs to be at least longer than 60s (which is the same as cache duration of ListRunners API) to not prematurely delete a runner pod that was just created.

But once I disabled automatic recreation of ephemeral runner pod, it turned out to be no more of an issue when it's being scaled via workflow_job webhook.

Ephemeral runner resources are still automatically added on demand by RunnerDeployment via RunnerReplicaSet(I've added EffectiveTime fields to our CRDs but that's an implementation detail so let's omit). A good side-effect of disabling ephemeral runner pod recreations is that ARC will no longer create redundant ephemeral runners when used with webhook-based autoscaler.

Basically, autoscaling still works as everyone might expect. It's just better than before overall.

mumoshu · 2022-02-17T02:23:27Z

github/github.go

-	httpClient := &http.Client{Transport: transport}
+	cached := httpcache.NewTransport(httpcache.NewMemoryCache())
+	cached.Transport = transport
+	metricsTransport := metrics.Transport{Transport: cached}


Perhaps this ordering results in the metrics not being distinguished between cached vs not-cached. But that's another story.

This will cache any GitHub API responses with correct Cache-Control header. `gregjones/httpcache` has been chosen as a library to implement this feature, as it is as recommended in `go-github`'s documentation: https://github.com/google/go-github#conditional-requests `gregjones/httpcache` supports a number of cache backends like `diskcache`, `s3cache`, and so on: https://github.com/gregjones/httpcache#cache-backends We stick to the built-in in-memory cache as a starter. Probably this will never becomes an issue as long as various HTTP responses for all the GitHub API calls that ARC makes, list-runners, list-workflow-jobs, list-runner-groups, etc., doesn't overflow the in-memory cache. `httpcache` has an known unfixed issue that it doesn't update cache on chunked responses. But we assume that the APIs that we call doesn't use chunked responses. See #1503 for more information on that. Ref #920

The log level -3 is the minimum log level that is supported today, smaller than debug(-1) and -2(used to log some HRA related logs). This commit adds a logging HTTP transport to log HTTP requests and responses to that log level. It implements http.RoundTripper so that it can log each HTTP request with useful metadata like `from_cache` and `ratelimit_remaining`. The former is set to `true` only when the logged request's response was served from ARC's in-memory cache. The latter is set to X-RateLimit-Remaining response header value if and only if the response was served by GitHub, not by ARC's cache.

…w test id So that one does not need to manually recreate ARC pods frequently.

so that you can run `kubectl logs` on controller pods without the specifying the container name. It is especially useful when you want to run kubectl-logs on all ARC pods across controller-manager and github-webhook-server like: ``` kubectl -n actions-runner-system logs -l app.kubernetes.io/name=actions-runner-controller ``` That was previously impossible due to that the selector matches pods from both controller-manager and github-webhook-server and kubectl does not provide a way to specify container names for respective pods.

mumoshu · 2022-02-19T16:27:49Z

ListRunners API calls are cached for 60 seconds now and that makes our IsRunnerBusy function and a pre-pod-deletion check almost useless, as it takes up to 60 seconds until you can see your runner's busy status.

Therefore, I've added 41e2a70. We no longer check for busy status of runner before deletion. Instead, it tries its best to call the RemoveRunner API before deleting the pod. A successful RemoveRunner call guarantees that the targeted runner to not run any jobs. So even without the ListRunners calls we can still ensure that ARC doesn't remove runners in the middle of running workflow jobs.

It also implements some grace period so that it's very unlikely to have the race issue discussed in #1085.

…d deletion Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion, so that we are very much unlikely to terminate pods that are running any jobs.

Ref #911 (comment)

mumoshu · 2022-02-20T04:53:26Z

It also implements some grace period so that it's very unlikely to have the race issue discussed in #1085.

This either made the runner pod termination process fragile to various race conditions or delays runner pod deletion depending on how long the grace period is.

But those issues turned out to disappear for ephemeral runners once we stop recreating ephemeral runner pods (without recreating the whole runner resources on k8s). So I'm going ahead that way.

Previously, The desired number of runner pods was maintained by Runner recreating the completed runner pod which was subject to the related race condition by itself.
It's now maintained by RunnerDeployment adding replacement Runner resources for Runner resources deleted due to ephemeral runner completions.

It works very well so far and it also looks a very natural behavior to me 😃

…on tests

…laySeconds~ and minReplicas for HRA

Izzette · 2022-02-21T10:17:06Z

acceptance/deploy.sh

 if [ "${tool}" == "helm" ]; then
+  set -v


Should this be kept?

Yep! I wanted to log the helm-upgrade command so that I can be sure that I have deployed it correctly, and that helps to isolate the cause of an e2e failure

Izzette · 2022-02-21T10:17:52Z

cmd/githubwebhookserver/main.go

-	logLevelDebug = "debug"
-	logLevelInfo  = "info"
-	logLevelWarn  = "warn"
-	logLevelError = "error"
-


nice cleanup :)

Thank you 😄

TingluoHuang · 2022-02-21T15:42:51Z

github/github.go

@@ -82,8 +87,11 @@ func (c *Config) NewClient() (*Client, error) {
 		transport = tr
 	}

-	transport = metrics.Transport{Transport: transport}
-	httpClient := &http.Client{Transport: transport}
+	cached := httpcache.NewTransport(httpcache.NewMemoryCache())


do we know how many requests will get benefit from the cache? ex: how many requests will end up in 304?

@TingluoHuang Not exactly. But I'd say 90% of requests are cached :)

I've added a new log level -3 so see if each request to GitHub API is cached or not. To enable the log level of -3 you would write values.yaml like...

logLevel: "-3" githubWebhookServer: logLevel: "-3"

Tail the logs from ARC controllers...

kubectl -n actions-runner-system logs -l app.kubernetes.io/name=actions-runner-controller -f | tee arc.log

And extract logged HTTP requests:

tail -f -n +1 arc.log | jq -rRc 'fromjson? | select(.from_cache != null) | "\(.ratelimit_remaining) \(.method) \(.url)"'

You'll see:

null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 3548 DELETE https://api.github.com/orgs/$ORG/$REPO/actions/runners/591 null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 3546 DELETE https://api.github.com/orgs/$ORG/$REPO/actions/runners/595 null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100 3544 DELETE https://api.github.com/orgs/$ORG/$REPO/actions/runners/589

The first column is the x-ratelimit-remaining header value returned by GitHub API that is logged only when it was NOT served from cache.

You can see that the list repository runners API calls are always cached(the cache-controller header for that response says max-age=60s so httpcache caches it for 60 seconds so all the runners for a repository shares the same cache, and an actually call happens only once per 60s, hence the first column is always null), and it occasionally calls the Delete Runner API.

🆒
will 90% still hold when the large customer using ARC to scale hundreds of ephemeral runners since the response from the ListRunner endpoint should change everytime on a busy repo/org/enterprise?

I guess it would scale well as this results in only one ListRunners call per 60s per RunnerDeployment.
So if a company has 1 enterprise runnerdeployment and 2 organizational runnerdeployments and 3 repository runnerdeployments, it might result in 6 calls per 60s.

So the best part of this change would be that ListRunners API calls will no longer grow proportionally to the number of runners. Instead, it's now proportional to the number of runnerdeployments, which can be 10x or 100x smaller than before in large deployments.

Note that you only see a change in runner status (offline vs online, busy or not, runner existence, etc) only after a ListRunners API call after cache invalidation i.e. it will take up to 60s or so until you see changes in runner statuses.

FWIW I think this is acceptable. The default cache control for github is 60s as you said, and it sounds acceptable to have a bit of leeway of 1min to detect an idle runner. The duration parameter protects the runners from being decommissioned too quickly assuming it's greater than the cache control so I can't see any downside in this approach afaict.

The duration parameter protects the runners from being decommissioned too quickly assuming it's greater than the cache control so I can't see any downside in this approach afaict.

Definitely! You should have HRA.spec.scaleTrigggers[].duration that is long enough to cover your longest-running workflow jobs anyway. That's crucial to let ARC not scale down the added runner too early in case of ARC failed to receive a webhook event of workflow_job completion. So, 60s of cache won't have real downside.

@tumoa and I have been discussion about a potential enhancement on how scale trigger duration works at:

#911 (comment)

sledigabel · 2022-02-21T16:47:48Z

github/github.go

+	cached.Transport = transport
+	loggingTransport := logging.Transport{Transport: cached, Log: c.Log}
+	metricsTransport := metrics.Transport{Transport: loggingTransport}
+	httpClient := &http.Client{Transport: metricsTransport}


@mumoshu Am I assuming right that we're reading the TTL for each query and honouring the cache-control header?

@sledigabel Yes!

sledigabel · 2022-02-22T09:22:59Z

@mumoshu awesome work, this is really exciting 😎

mumoshu · 2022-02-27T23:36:47Z

I'll shortly submit a pull request to add similar reliability enhancements to RunnerSet as well. Stay tuned!

…er-and-runnerset Refactor Runner and RunnerSet so that they use the same library code that powers RunnerSet. RunnerSet is StatefulSet-based and RunnerSet/Runner is Pod-based so it had been hard to unify the implementation although they look very similar in many aspects. This change finally resolves that issue, by first introducing a library that implements the generic logic that is used to reconcile RunnerSet, then adding an adapter that can be used to let the generic logic manage runner pods via Runner, instead of via StatefulSet. Follow-up for #1127, #1167, and 1178

We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA. Just remove the legacy cache to keep it simple and easy to understand.

* Remove legacy GitHub API cache of HRA.Status.CachedEntries We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA. Just remove the legacy cache to keep it simple and easy to understand. * Deprecate githubAPICacheDuration helm chart value and the --github-api-cache-duration as well * Fix integration test

Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy. There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job. This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter. The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before. Ref #1167 (comment)

The version of `bradleyfalzon/ghinstallation` which is used to enable GitHub App authentication turned out to add an extra header `application/vnd.github.machine-man-preview+json` to every HTTP request. That revealed an edge-case in our HTTP cache layer `gregjones/httpcache` that results it to not serve responses from cache when it should. There were two problems. One was that it does not support multi-valued header and it only looked for the first value for each header, and another is that it does not support any http.RoundTripper implementation that modifies HTTP request headers in a RoundTrip function call. I fixed it in my fork of httpcache, which is hosted at https://github.com/actions-runner-controller/httpcache. The relevant commits are: - actions-runner-controller/httpcache@70d975e - actions-runner-controller/httpcache@197a8a3 This can be considered as a follow-up for #1127, which turned out to have enabled the cache only for the case that ARC uses PAT for authentication. Since this fix, the cache is also enabled when ARC authenticates as a GitHub App.

This removes the flag and code for the legacy GitHub API cache. We already migrated to fully use the new HTTP cache based API cache functionality which had been added via #1127 and available since ARC 0.22.0. Since then, the legacy one had been no-op and therefore removing it is safe. Ref #1412

mumoshu commented Feb 17, 2022

View reviewed changes

mumoshu force-pushed the github-api-cache branch from d59d6f5 to 9b04a08 Compare February 19, 2022 09:28

mumoshu added 4 commits February 19, 2022 12:22

acceptance: Improve deploy.sh to recreate ARC (not runner) pods on ne…

f3ceccd

…w test id So that one does not need to manually recreate ARC pods frequently.

mumoshu force-pushed the github-api-cache branch from 9b04a08 to 9e356b4 Compare February 19, 2022 12:23

mumoshu force-pushed the github-api-cache branch 2 times, most recently from 626de35 to 989d449 Compare February 20, 2022 02:55

mumoshu mentioned this pull request Feb 20, 2022

Random Operation Cancelled Runner Decommissions #911

Closed

mumoshu added 3 commits February 20, 2022 04:36

e2e: Add ability to toggle dockerdWithinRunnerContainer

4e6bfd8

Stop recreating ephemeral runner pod

79a3132

Ref #911 (comment)

mumoshu force-pushed the github-api-cache branch from 989d449 to 79a3132 Compare February 20, 2022 04:42

mumoshu force-pushed the github-api-cache branch from 35b2280 to bc06dcf Compare February 20, 2022 08:03

Make unregistration timeout and retry delay configurable in integrati…

a6f0e00

…on tests

mumoshu force-pushed the github-api-cache branch 3 times, most recently from f1e4287 to 16791d9 Compare February 20, 2022 13:40

mumoshu added 2 commits February 20, 2022 13:45

acceptance,e2e: Enhance E2E test and deploy.sh to support scaleDownDe…

d4a9750

…laySeconds~ and minReplicas for HRA

Prevent unnecessary ephemeral runner recreations

b8e65aa

mumoshu force-pushed the github-api-cache branch from 16791d9 to b8e65aa Compare February 20, 2022 13:45

mumoshu added 2 commits February 21, 2022 00:06

Enhance HRA capacity reservation update log

5bc16f2

acceptance,e2e: Let capacity reservation expired more later

1463d49

Izzette reviewed Feb 21, 2022

View reviewed changes

TingluoHuang reviewed Feb 21, 2022

View reviewed changes

sledigabel reviewed Feb 21, 2022

View reviewed changes

mumoshu mentioned this pull request Feb 22, 2022

Ephemeral mode with runner Controller does not work as expected #1044

Closed

2 tasks

toast-gear added this to the v0.22.0 milestone Feb 24, 2022

mumoshu merged commit 686d40c into master Feb 27, 2022

mumoshu deleted the github-api-cache branch February 27, 2022 23:37

mumoshu mentioned this pull request Mar 8, 2022

Remove legacy GitHub API cache of HRA.Status.CachedEntries #1192

Merged

mumoshu mentioned this pull request Mar 12, 2022

Fix GitHub API cache to work with GitHub App authentication #1210

Merged

mumoshu mentioned this pull request Jun 29, 2022

Fix PercentageRunnersBusy scaling delay #1579

Merged

mumoshu mentioned this pull request Jul 12, 2022

Remove github-api-cache-duration flag and code #1631

Merged

mumoshu mentioned this pull request Sep 22, 2022

PVC and STS should get stable names #1825

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GitHub API cache to avoid rate limit #1127

Add GitHub API cache to avoid rate limit #1127

mumoshu commented Feb 17, 2022 •

edited

Loading

mumoshu Feb 17, 2022

mumoshu commented Feb 19, 2022 •

edited

Loading

mumoshu commented Feb 20, 2022

Izzette Feb 21, 2022

mumoshu Feb 21, 2022

Izzette Feb 21, 2022

mumoshu Feb 21, 2022

TingluoHuang Feb 21, 2022

mumoshu Feb 21, 2022 •

edited

Loading

TingluoHuang Feb 22, 2022

mumoshu Feb 22, 2022 •

edited

Loading

sledigabel Feb 22, 2022

mumoshu Feb 24, 2022 •

edited

Loading

mumoshu Feb 27, 2022

sledigabel Feb 21, 2022

mumoshu Feb 21, 2022

sledigabel commented Feb 22, 2022

mumoshu commented Feb 27, 2022

Add GitHub API cache to avoid rate limit #1127

Add GitHub API cache to avoid rate limit #1127

Conversation

mumoshu commented Feb 17, 2022 • edited Loading

Cache Implementation

Ephemeral runner pods are no longer recreated

Choose a reason for hiding this comment

mumoshu commented Feb 19, 2022 • edited Loading

mumoshu commented Feb 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumoshu Feb 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumoshu Feb 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumoshu Feb 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sledigabel commented Feb 22, 2022

mumoshu commented Feb 27, 2022

mumoshu commented Feb 17, 2022 •

edited

Loading

mumoshu commented Feb 19, 2022 •

edited

Loading

mumoshu Feb 21, 2022 •

edited

Loading

mumoshu Feb 22, 2022 •

edited

Loading

mumoshu Feb 24, 2022 •

edited

Loading