Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lot's of discovery cache throttling with many CRDs #2084

Closed
jonnylangefeld opened this issue Dec 7, 2022 · 6 comments
Closed

Lot's of discovery cache throttling with many CRDs #2084

jonnylangefeld opened this issue Dec 7, 2022 · 6 comments

Comments

@jonnylangefeld
Copy link

jonnylangefeld commented Dec 7, 2022

When using controller-runtime on a cluster with many CRDs installed, I see many client-side throttling log lines such as these:

I1207 20:56:06.276269       1 request.go:601] Waited for 1.192751471s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/memcache.cnrm.cloud.google.com/v1beta1?timeout=32s
I1207 20:56:16.276933       1 request.go:601] Waited for 11.193341929s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/essentialcontacts.gcp.upbound.io/v1beta1?timeout=32s
I1207 20:56:26.476575       1 request.go:601] Waited for 21.392897152s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/servicedirectory.cnrm.cloud.google.com/v1beta1?timeout=32s
I1207 20:56:36.476594       1 request.go:601] Waited for 31.392687318s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/iam.aws.crossplane.io/v1beta1?timeout=32s
I1207 20:56:46.604938       1 request.go:601] Waited for 4.592238125s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/iam.aws.crossplane.io/v1beta1?timeout=32s
I1207 20:56:56.605021       1 request.go:601] Waited for 14.592156935s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/microsoft.compute.azure.com/v1alpha1api20200930storage?timeout=32s
I1207 20:57:06.805227       1 request.go:601] Waited for 24.79219262s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/servicenetworking.cnrm.cloud.google.com/v1beta1?timeout=32s
I1207 20:57:16.805454       1 request.go:601] Waited for 34.791425163s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/sourcerepo.gcp.upbound.io/v1beta1?timeout=32s
I1207 20:57:26.827615       1 request.go:601] Waited for 9.194122277s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/helm.crossplane.io/v1alpha1?timeout=32s
I1207 20:57:37.026258       1 request.go:601] Waited for 19.392665018s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/bigquery.cnrm.cloud.google.com/v1beta1?timeout=32s
I1207 20:57:47.026976       1 request.go:601] Waited for 29.393315758s due to client-side throttling, not priority and fairness, request: GET:https://172.16.192.1:443/apis/storage.cnrm.cloud.google.com/v1beta1?timeout=32s

This has been a larger issue in the Kubernetes community which also affected kubectl and therefore even more end users. It's arguably more annoying to get slowed down for every kubectl command than just a slowed down controller that runs in the background. In kubectl it has been fixed via .WithDiscoveryBurst(300).WithDiscoveryQPS(50.0) (source) and by extending the TTL of the discovery cache.

However, controller-runtime uses the same config options for regular API calls and the discovery client. We should implement the same fix as in kubectl to allow for higher rate limits during discovery to speed up controller reconciliation.

cc @negz

@lobziik
Copy link
Member

lobziik commented Dec 9, 2022

When using controller-runtime on a cluster with many CRDs installed, I see many client-side throttling log lines such as these

Recently we faced not exact, but similar issues due to discovery is quite bursty in terms of network requests.

I wonder, what do people think about revisiting the discovery process and make it lazy? Or, alternatively, provide some knobs for configuring specific groups for initial discovery and ignore groups that are not needed.

We should implement the same fix as in kubectl to allow for higher rate limits during discovery to speed up controller reconciliation.

I understand this as a proposal to increase the number of requests per second by default for the discovery procedure, am I right?

@alvaroaleman
Copy link
Member

Recently we faced not exact, but similar issues due to discovery is quite bursty in terms of network requests.

And that is an issue? So you are saying the original ask here (higher ratelimit for discovery) would be a problem for you?

I wonder, what do people think about revisiting the discovery process and make it lazy?

There is an open issue for this #1603 but someone has to actually do the work. That didn't happen yet. Since we get the group version from the scheme I think it should be possible to do that but no one did yet. Maybe I am also missing something and it is not actually possible.

IIRC there was also some work going on in upstream to provide a single discovery document rather than the current hierarchy which would also alleviate this issue.

@alvaroaleman
Copy link
Member

alvaroaleman commented Dec 10, 2022

This is the upstream sig-apimachinery kep I was referring to: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3352-aggregated-discovery
It actually looks like this made it into 1.26 (but requires a 1.26 KAS to work): kubernetes/kubernetes#113599 and requires us to do changes to use the new GroupsAndMaybeResources and fallback if we don't get the resources

@lobziik
Copy link
Member

lobziik commented Dec 15, 2022

And that is an issue? So you are saying the original ask here (higher ratelimit for discovery) would be a problem for you?

We faced controller crashes during discovery because of network issues. Presumably, in our case, this was caused by the LB configuration set up in front of the API server.
A higher ratelimit for discovery would cause even more requests per second and might cause issues for setups with not really good networking.

I'm thinking about clusters on top of on-premish platforms such as vsphere, and use cases that suppose plenty of controller restarts (openshift upgrades :P ).

Anyway, @alvaroaleman, many thanks for the info about the prior work! Lets move conversations about lazy discovery there into respective issue.

@vincepri
Copy link
Member

Looking through the upstream codebase I think we actually don't need to change anything in the current code for GroupsAndMaybeResources to work with 1.26 clusters (although I might be wrong, only gave a quick look).

It looks like we use

func NewDiscoveryRESTMapper(c *rest.Config) (meta.RESTMapper, error) {

and the same goe for the Dynamic one
client, err := discovery.NewDiscoveryClientForConfig(cfg)

Which internally calls GetAPIGroupResources, which in turn calls ServerGroupsAndResources, in there there is https://github.com/kubernetes/kubernetes/blob/a93eda9db305611cacd8b6ee930ab3149a08f9b0/staging/src/k8s.io/client-go/discovery/discovery_client.go#L388-L398.

A check is made for AggregatedDiscoveryInterface, which then tries to call GroupsAndMaybeResources — given that we're using the client-go discovery client, that should work seamlessly?

@alvaroaleman
Copy link
Member

Closing this in favor of #1603, this here is a symptom of that not being fixed. There is currently an in-flight PR for this: #2116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants