Lot's of discovery cache throttling with many CRDs #2084

jonnylangefeld · 2022-12-07T21:46:13Z

When using controller-runtime on a cluster with many CRDs installed, I see many client-side throttling log lines such as these:

I1207 20:56:06.276269       1 request.go:601] Waited for 1.192751471s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/memcache.cnrm.cloud.google.com/v1beta1?timeout=32s
I1207 20:56:16.276933       1 request.go:601] Waited for 11.193341929s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/essentialcontacts.gcp.upbound.io/v1beta1?timeout=32s
I1207 20:56:26.476575       1 request.go:601] Waited for 21.392897152s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/servicedirectory.cnrm.cloud.google.com/v1beta1?timeout=32s
I1207 20:56:36.476594       1 request.go:601] Waited for 31.392687318s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/iam.aws.crossplane.io/v1beta1?timeout=32s
I1207 20:56:46.604938       1 request.go:601] Waited for 4.592238125s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/iam.aws.crossplane.io/v1beta1?timeout=32s
I1207 20:56:56.605021       1 request.go:601] Waited for 14.592156935s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/microsoft.compute.azure.com/v1alpha1api20200930storage?timeout=32s
I1207 20:57:06.805227       1 request.go:601] Waited for 24.79219262s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/servicenetworking.cnrm.cloud.google.com/v1beta1?timeout=32s
I1207 20:57:16.805454       1 request.go:601] Waited for 34.791425163s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/sourcerepo.gcp.upbound.io/v1beta1?timeout=32s
I1207 20:57:26.827615       1 request.go:601] Waited for 9.194122277s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/helm.crossplane.io/v1alpha1?timeout=32s
I1207 20:57:37.026258       1 request.go:601] Waited for 19.392665018s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/bigquery.cnrm.cloud.google.com/v1beta1?timeout=32s
I1207 20:57:47.026976       1 request.go:601] Waited for 29.393315758s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/storage.cnrm.cloud.google.com/v1beta1?timeout=32s

This has been a larger issue in the Kubernetes community which also affected kubectl and therefore even more end users. It's arguably more annoying to get slowed down for every kubectl command than just a slowed down controller that runs in the background. In kubectl it has been fixed via .WithDiscoveryBurst(300).WithDiscoveryQPS(50.0) (source) and by extending the TTL of the discovery cache.

However, controller-runtime uses the same config options for regular API calls and the discovery client. We should implement the same fix as in kubectl to allow for higher rate limits during discovery to speed up controller reconciliation.

cc @negz

The text was updated successfully, but these errors were encountered:

lobziik · 2022-12-09T12:29:03Z

When using controller-runtime on a cluster with many CRDs installed, I see many client-side throttling log lines such as these

Recently we faced not exact, but similar issues due to discovery is quite bursty in terms of network requests.

I wonder, what do people think about revisiting the discovery process and make it lazy? Or, alternatively, provide some knobs for configuring specific groups for initial discovery and ignore groups that are not needed.

We should implement the same fix as in kubectl to allow for higher rate limits during discovery to speed up controller reconciliation.

I understand this as a proposal to increase the number of requests per second by default for the discovery procedure, am I right?

alvaroaleman · 2022-12-10T16:55:48Z

Recently we faced not exact, but similar issues due to discovery is quite bursty in terms of network requests.

And that is an issue? So you are saying the original ask here (higher ratelimit for discovery) would be a problem for you?

I wonder, what do people think about revisiting the discovery process and make it lazy?

There is an open issue for this #1603 but someone has to actually do the work. That didn't happen yet. Since we get the group version from the scheme I think it should be possible to do that but no one did yet. Maybe I am also missing something and it is not actually possible.

IIRC there was also some work going on in upstream to provide a single discovery document rather than the current hierarchy which would also alleviate this issue.

alvaroaleman · 2022-12-10T17:01:03Z

This is the upstream sig-apimachinery kep I was referring to: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3352-aggregated-discovery
It actually looks like this made it into 1.26 (but requires a 1.26 KAS to work): kubernetes/kubernetes#113599 and requires us to do changes to use the new GroupsAndMaybeResources and fallback if we don't get the resources

lobziik · 2022-12-15T15:54:40Z

And that is an issue? So you are saying the original ask here (higher ratelimit for discovery) would be a problem for you?

We faced controller crashes during discovery because of network issues. Presumably, in our case, this was caused by the LB configuration set up in front of the API server.
A higher ratelimit for discovery would cause even more requests per second and might cause issues for setups with not really good networking.

I'm thinking about clusters on top of on-premish platforms such as vsphere, and use cases that suppose plenty of controller restarts (openshift upgrades :P ).

Anyway, @alvaroaleman, many thanks for the info about the prior work! Lets move conversations about lazy discovery there into respective issue.

vincepri · 2022-12-17T01:23:08Z

Looking through the upstream codebase I think we actually don't need to change anything in the current code for GroupsAndMaybeResources to work with 1.26 clusters (although I might be wrong, only gave a quick look).

It looks like we use

controller-runtime/pkg/client/apiutil/apimachinery.go

Line 62 in 69f0938

func NewDiscoveryRESTMapper(c *rest.Config) (meta.RESTMapper, error) {

and the same goe for the Dynamic one

controller-runtime/pkg/client/apiutil/dynamicrestmapper.go

Line 79 in 69f0938

client, err := discovery.NewDiscoveryClientForConfig(cfg)

Which internally calls GetAPIGroupResources, which in turn calls ServerGroupsAndResources, in there there is https://github.com/kubernetes/kubernetes/blob/a93eda9db305611cacd8b6ee930ab3149a08f9b0/staging/src/k8s.io/client-go/discovery/discovery_client.go#L388-L398.

A check is made for AggregatedDiscoveryInterface, which then tries to call GroupsAndMaybeResources — given that we're using the client-go discovery client, that should work seamlessly?

alvaroaleman · 2023-01-29T22:44:06Z

Closing this in favor of #1603, this here is a symptom of that not being fixed. There is currently an in-flight PR for this: #2116

alvaroaleman closed this as completed Jan 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lot's of discovery cache throttling with many CRDs #2084

Lot's of discovery cache throttling with many CRDs #2084

jonnylangefeld commented Dec 7, 2022 •

edited

Loading

lobziik commented Dec 9, 2022

alvaroaleman commented Dec 10, 2022

alvaroaleman commented Dec 10, 2022 •

edited

Loading

lobziik commented Dec 15, 2022

vincepri commented Dec 17, 2022

alvaroaleman commented Jan 29, 2023

Lot's of discovery cache throttling with many CRDs #2084

Lot's of discovery cache throttling with many CRDs #2084

Comments

jonnylangefeld commented Dec 7, 2022 • edited Loading

lobziik commented Dec 9, 2022

alvaroaleman commented Dec 10, 2022

alvaroaleman commented Dec 10, 2022 • edited Loading

lobziik commented Dec 15, 2022

vincepri commented Dec 17, 2022

alvaroaleman commented Jan 29, 2023

jonnylangefeld commented Dec 7, 2022 •

edited

Loading

alvaroaleman commented Dec 10, 2022 •

edited

Loading