Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GitHub API cache to avoid rate limit #1127

Merged
merged 12 commits into from
Feb 27, 2022
Merged

Add GitHub API cache to avoid rate limit #1127

merged 12 commits into from
Feb 27, 2022

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Feb 17, 2022

Enhances ARC(both the controller-manager and github-webhook-server) to cache any GitHub API responses with HTTP GET and an appropriate Cache-Control header.

Ref #920

Cache Implementation

gregjones/httpcache has been chosen as a library to implement this feature, as it is as recommended in go-github's documentation:

https://github.com/google/go-github#conditional-requests

gregjones/httpcache supports a number of cache backends like diskcache, s3cache, and so on:

https://github.com/gregjones/httpcache#cache-backends

We stick to the built-in in-memory cache as a starter. Probably this will never becomes an issue as long as various HTTP responses for all the GitHub API calls that ARC makes, list-runners, list-workflow-jobs, list-runner-groups, etc., doesn't overflow the in-memory cache.

httpcache has an known unfixed issue that it doesn't update cache on chunked responses. But we assume that the APIs that we call doesn't use chunked responses. See #1503 for more information on that.

Ephemeral runner pods are no longer recreated

The addition of the cache layer resulted in a slow down of a scale-down process and a trade-off between making the runner pod termination process fragile to various race conditions(shorter grace period before runner deletion) or delaying runner pod deletion depending on how long the grace period is(longer grace period). A grace period needs to be at least longer than 60s (which is the same as cache duration of ListRunners API) to not prematurely delete a runner pod that was just created.

But once I disabled automatic recreation of ephemeral runner pod, it turned out to be no more of an issue when it's being scaled via workflow_job webhook.

Ephemeral runner resources are still automatically added on demand by RunnerDeployment via RunnerReplicaSet(I've added EffectiveTime fields to our CRDs but that's an implementation detail so let's omit). A good side-effect of disabling ephemeral runner pod recreations is that ARC will no longer create redundant ephemeral runners when used with webhook-based autoscaler.

Basically, autoscaling still works as everyone might expect. It's just better than before overall.

github/github.go Outdated
httpClient := &http.Client{Transport: transport}
cached := httpcache.NewTransport(httpcache.NewMemoryCache())
cached.Transport = transport
metricsTransport := metrics.Transport{Transport: cached}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this ordering results in the metrics not being distinguished between cached vs not-cached. But that's another story.

This will cache any GitHub API responses with correct Cache-Control header.

`gregjones/httpcache` has been chosen as a library to implement this feature, as it is as recommended in `go-github`'s documentation:

https://github.com/google/go-github#conditional-requests

`gregjones/httpcache` supports a number of cache backends like `diskcache`, `s3cache`, and so on:

https://github.com/gregjones/httpcache#cache-backends

We stick to the built-in in-memory cache as a starter. Probably this will never becomes an issue as long as various HTTP responses for all the GitHub API calls that ARC makes, list-runners, list-workflow-jobs, list-runner-groups, etc., doesn't overflow the in-memory cache.

`httpcache` has an known unfixed issue that it doesn't update cache on chunked responses. But we assume that the APIs that we call doesn't use chunked responses. See #1503 for more information on that.

Ref #920
The log level -3 is the minimum log level that is supported today, smaller than debug(-1) and -2(used to log some HRA related logs).

This commit adds a logging HTTP transport to log HTTP requests and responses to that log level.

It implements http.RoundTripper so that it can log each HTTP request with useful metadata like `from_cache` and `ratelimit_remaining`.
The former is set to `true` only when the logged request's response was served from ARC's in-memory cache.
The latter is set to X-RateLimit-Remaining response header value if and only if the response was served by GitHub, not by ARC's cache.
…w test id

So that one does not need to manually recreate ARC pods frequently.
so that you can run `kubectl logs` on controller pods without the specifying the container name.

It is especially useful when you want to run kubectl-logs on all ARC pods across controller-manager and github-webhook-server like:

```
kubectl -n actions-runner-system logs -l app.kubernetes.io/name=actions-runner-controller
```

That was previously impossible due to that the selector matches pods from both controller-manager and github-webhook-server and kubectl does not provide a way to specify container names for respective pods.
@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 19, 2022

ListRunners API calls are cached for 60 seconds now and that makes our IsRunnerBusy function and a pre-pod-deletion check almost useless, as it takes up to 60 seconds until you can see your runner's busy status.

Therefore, I've added 41e2a70. We no longer check for busy status of runner before deletion. Instead, it tries its best to call the RemoveRunner API before deleting the pod. A successful RemoveRunner call guarantees that the targeted runner to not run any jobs. So even without the ListRunners calls we can still ensure that ARC doesn't remove runners in the middle of running workflow jobs.

It also implements some grace period so that it's very unlikely to have the race issue discussed in #1085.

@mumoshu mumoshu force-pushed the github-api-cache branch 2 times, most recently from 626de35 to 989d449 Compare February 20, 2022 02:55
…d deletion

Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion,
so that we are very much unlikely to terminate pods that are running any jobs.
@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 20, 2022

It also implements some grace period so that it's very unlikely to have the race issue discussed in #1085.

This either made the runner pod termination process fragile to various race conditions or delays runner pod deletion depending on how long the grace period is.

But those issues turned out to disappear for ephemeral runners once we stop recreating ephemeral runner pods (without recreating the whole runner resources on k8s). So I'm going ahead that way.

Previously, The desired number of runner pods was maintained by Runner recreating the completed runner pod which was subject to the related race condition by itself.
It's now maintained by RunnerDeployment adding replacement Runner resources for Runner resources deleted due to ephemeral runner completions.

It works very well so far and it also looks a very natural behavior to me 😃

@mumoshu mumoshu force-pushed the github-api-cache branch 3 times, most recently from f1e4287 to 16791d9 Compare February 20, 2022 13:40
if [ "${tool}" == "helm" ]; then
set -v
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be kept?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I wanted to log the helm-upgrade command so that I can be sure that I have deployed it correctly, and that helps to isolate the cause of an e2e failure

Comment on lines -50 to -54
logLevelDebug = "debug"
logLevelInfo = "info"
logLevelWarn = "warn"
logLevelError = "error"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice cleanup :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 😄

@@ -82,8 +87,11 @@ func (c *Config) NewClient() (*Client, error) {
transport = tr
}

transport = metrics.Transport{Transport: transport}
httpClient := &http.Client{Transport: transport}
cached := httpcache.NewTransport(httpcache.NewMemoryCache())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know how many requests will get benefit from the cache? ex: how many requests will end up in 304?

Copy link
Collaborator Author

@mumoshu mumoshu Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TingluoHuang Not exactly. But I'd say 90% of requests are cached :)

I've added a new log level -3 so see if each request to GitHub API is cached or not. To enable the log level of -3 you would write values.yaml like...

logLevel: "-3"
githubWebhookServer:
  logLevel: "-3"

Tail the logs from ARC controllers...

kubectl -n actions-runner-system logs -l app.kubernetes.io/name=actions-runner-controller -f | tee arc.log

And extract logged HTTP requests:

tail -f -n +1 arc.log | jq -rRc 'fromjson? | select(.from_cache != null) | "\(.ratelimit_remaining) \(.method) \(.url)"'

You'll see:

null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
3548 DELETE https://api.github.com/orgs/$ORG/$REPO/actions/runners/591
null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
3546 DELETE https://api.github.com/orgs/$ORG/$REPO/actions/runners/595
null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
null GET https://api.github.com/orgs/$ORG/$REPO/actions/runners?per_page=100
3544 DELETE https://api.github.com/orgs/$ORG/$REPO/actions/runners/589

The first column is the x-ratelimit-remaining header value returned by GitHub API that is logged only when it was NOT served from cache.

You can see that the list repository runners API calls are always cached(the cache-controller header for that response says max-age=60s so httpcache caches it for 60 seconds so all the runners for a repository shares the same cache, and an actually call happens only once per 60s, hence the first column is always null), and it occasionally calls the Delete Runner API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆒
will 90% still hold when the large customer using ARC to scale hundreds of ephemeral runners since the response from the ListRunner endpoint should change everytime on a busy repo/org/enterprise?

Copy link
Collaborator Author

@mumoshu mumoshu Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would scale well as this results in only one ListRunners call per 60s per RunnerDeployment.
So if a company has 1 enterprise runnerdeployment and 2 organizational runnerdeployments and 3 repository runnerdeployments, it might result in 6 calls per 60s.

So the best part of this change would be that ListRunners API calls will no longer grow proportionally to the number of runners. Instead, it's now proportional to the number of runnerdeployments, which can be 10x or 100x smaller than before in large deployments.

Note that you only see a change in runner status (offline vs online, busy or not, runner existence, etc) only after a ListRunners API call after cache invalidation i.e. it will take up to 60s or so until you see changes in runner statuses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think this is acceptable. The default cache control for github is 60s as you said, and it sounds acceptable to have a bit of leeway of 1min to detect an idle runner. The duration parameter protects the runners from being decommissioned too quickly assuming it's greater than the cache control so I can't see any downside in this approach afaict.

Copy link
Collaborator Author

@mumoshu mumoshu Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duration parameter protects the runners from being decommissioned too quickly assuming it's greater than the cache control so I can't see any downside in this approach afaict.

Definitely! You should have HRA.spec.scaleTrigggers[].duration that is long enough to cover your longest-running workflow jobs anyway. That's crucial to let ARC not scale down the added runner too early in case of ARC failed to receive a webhook event of workflow_job completion. So, 60s of cache won't have real downside.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tumoa and I have been discussion about a potential enhancement on how scale trigger duration works at:

#911 (comment)

cached.Transport = transport
loggingTransport := logging.Transport{Transport: cached, Log: c.Log}
metricsTransport := metrics.Transport{Transport: loggingTransport}
httpClient := &http.Client{Transport: metricsTransport}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumoshu Am I assuming right that we're reading the TTL for each query and honouring the cache-control header?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sledigabel
Copy link
Contributor

@mumoshu awesome work, this is really exciting 😎

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 27, 2022

I'll shortly submit a pull request to add similar reliability enhancements to RunnerSet as well. Stay tuned!

@mumoshu mumoshu merged commit 686d40c into master Feb 27, 2022
@mumoshu mumoshu deleted the github-api-cache branch February 27, 2022 23:37
mumoshu added a commit that referenced this pull request Mar 6, 2022
…er-and-runnerset

Refactor Runner and RunnerSet so that they use the same library code that powers RunnerSet.

RunnerSet is StatefulSet-based and RunnerSet/Runner is Pod-based so it had been hard to unify the implementation although they look very similar in many aspects.

This change finally resolves that issue, by first introducing a library that implements the generic logic that is used to reconcile RunnerSet, then adding an adapter that can be used to let the generic logic manage runner pods via Runner, instead of via StatefulSet.

Follow-up for #1127, #1167, and 1178
mumoshu added a commit that referenced this pull request Mar 8, 2022
We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA.
Just remove the legacy cache to keep it simple and easy to understand.
mumoshu added a commit that referenced this pull request Mar 8, 2022
* Remove legacy GitHub API cache of HRA.Status.CachedEntries

We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA.
Just remove the legacy cache to keep it simple and easy to understand.

* Deprecate githubAPICacheDuration helm chart value and the --github-api-cache-duration as well

* Fix integration test
mumoshu added a commit that referenced this pull request Mar 11, 2022
Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy.
There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job.

This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter.

The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before.

Ref #1167 (comment)
mumoshu added a commit that referenced this pull request Mar 12, 2022
The version of `bradleyfalzon/ghinstallation` which is used to enable GitHub App authentication turned out to add an extra header `application/vnd.github.machine-man-preview+json` to every HTTP request. That revealed an edge-case in our HTTP cache layer `gregjones/httpcache` that results it to not serve responses from cache when it should.

There were two problems. One was that it does not support multi-valued header and it only looked for the first value for each header, and another is that it does not support any http.RoundTripper implementation that modifies HTTP request headers in a RoundTrip function call.

I fixed it in my fork of httpcache, which is hosted at https://github.com/actions-runner-controller/httpcache.

The relevant commits are:

- actions-runner-controller/httpcache@70d975e
- actions-runner-controller/httpcache@197a8a3

This can be considered as a follow-up for #1127, which turned out to have enabled the cache only for the case that ARC uses PAT for authentication.
Since this fix, the cache is also enabled when ARC authenticates as a GitHub App.
mumoshu added a commit that referenced this pull request Jul 12, 2022
This removes the flag and code for the legacy GitHub API cache. We already migrated to fully use the new HTTP cache based API cache functionality which had been added via #1127 and available since ARC 0.22.0. Since then, the legacy one had been no-op and therefore removing it is safe.

Ref #1412
mumoshu added a commit that referenced this pull request Jul 12, 2022
This removes the flag and code for the legacy GitHub API cache. We already migrated to fully use the new HTTP cache based API cache functionality which had been added via #1127 and available since ARC 0.22.0. Since then, the legacy one had been no-op and therefore removing it is safe.

Ref #1412
@mumoshu mumoshu mentioned this pull request Sep 22, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants