-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust Discovery throttling #3368
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Thanks @jashandeep-sohi for the PR. @karlkfi Can you pl. take a look this. |
🤔 This should probably just be part of UpdateQPS, so that if server-side throttling is enabled the client-side discovery throttling is disabled. I had assumed that disabling the QPS would also disable the discovery QPS, but I think the discovery QPS has its own separate default, and that’s why you’re seeing a delay. How many CRDs are you applying in one package tho? We might need to build a smarter discovery mapper to really solve this problem. Right now the cache invalidation is all or nothing and preemptively loads all the CRDs after each reset. |
It's not even that many CRDs at the moment. It's about 26 right now (all the cluster-api ones), but that's certainly expected to increase as we start using more crossplane ones. I followed your lead, and my understanding is it's not primarily the number of CRDs you're applying that matters, but rather the total number of different API Groups that are already installed on the cluster:
Perhaps the real fix is better cache invalidation. Wouldn't it be enough to just invalidate a single API Group? i.e. the one that's being applied? Maybe I should also open up an issue upstream, but that seems like a much bigger change that'll take some time to make it's way through. Regardless, I think adjusting these parameters (or providing a way to customize them at runtime) would be helpful. |
Yeah, the upstream mapper is the root problem here. It doesn’t provide an API for invalidating a single API group, and it’s too aggressive about pre-loading because it’s designed to assume that a cache miss means it’s not registered on the server. There have been a number of proposals to fix this but none of them have been implemented yet. On the easier side, we could expose the QPS config to the user. That feels like lazy UX, hoisting a solvable problem onto the user to configure. But maybe that’s preferable to not having a good solution in the short term. I know some Config Sync users that also want to be able to decrease the QPS to reduce server load without needing to configure QoS on every cluster. |
I think this is a good idea, client side throttling is just not the right way for the apiserver to protect against overload, and we have priority in fairness in newer kubernetes versions - I believe all OSS supported kubernetes versions. In kubebuilder-declarative-pattern we are experimenting with an alternative RESTMapper that doesn't fetch things we don't need, I'd also be in favor of reusing that: https://github.com/kubernetes-sigs/kubebuilder-declarative-pattern/tree/master/pkg/restmapper As you point out @jashandeep-sohi it's hard to get this accepted upstream without demonstrating it first, so the strategy we seem to be converging on is copy-and-pasting the code between the various projects and letting them evolved independently (with occasional sync-ups), and then we can try to get this upstreamed into client-go or wherever it belongs best. Thanks for the contribution @jashandeep-sohi :-) One problem is that it looks like you haven't signed the project's CLA ( https://github.com/GoogleContainerTools/kpt/pull/3368/checks?check_run_id=7372699029 ) - are you able to sign it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but we can't merge without the CLA
Actually I don’t think we want to disable QPS here. If you move it down into UpdateQPS tho that’d be ok. Then it only disables on newer clusters with P&F enabled. |
We still support v1.21 where P&F is alpha. GKE stable uses it still. That’s why UpdateQPS does the check to see if it’s enabled before disabling QPS client-side. |
Oh, also, I’ve found the P&F docs to be inscrutable. Nobody knows how to configure it and there aren’t very many examples available. I’ve been meaning to interrogate the api machinery team and get some recommendations published but haven’t gotten around to it yet. |
Right, forgot about the CLA. Should be signed now. And thanks for the pointers about getting things merged upstream. I also took a quick crack at improving the cache invalidation in forks:
It's certainly faster in my not-so-real-world test:
|
I tried to do that as well, but as far as I can tell https://pkg.go.dev/k8s.io/client-go@v0.24.3/rest#Config doesn't expose anything related to discovery. It's being handled here https://github.com/kubernetes/cli-runtime/blob/4fdf49ae46a0caa7fafdfe97825c6129d5153f06/pkg/genericclioptions/config_flags.go#L269-L289 (which is also where I presume the values set by the UpdateQPS callback are overwritten). |
No. It's not safe to disable discoveryQPS and discoveryBurst unless we know server-side throttling is enabled. That's what the code in UpdateQPS is doing. Unfortunately, it means that lookup logic needs to be pulled out and done earlier, so it can influence QPS in both use cases, because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please modify UpdateQPS to take *ConfigFlags
and set WrapConfigFn, DiscoveryBurst, and DiscoveryQPS after looking up whether server-side throttling is enabled.
Looks like this is now causing test failure in TestFnRender and TestFnEval. Lots of logs like:
It's possible this is now trying to query the server before the connection is configured. I guess because WrapConfigFn used to delay the call until later in the process. I wonder if |
Ok, wrapped everything in a closure and that fixed the tests. And I don't think it's too late because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Thanks! This seems like a great improvement.
We've probably want to update kpt-config-sync too. And we JUST migrated the source of truth to GitHub yesterday for that, so we can accept PRs now. Is that something you're interested in doing now that we know the change to make? I think the code is very similar there, if it's not already using this function (which would mean needing a kpt dep bump after the next kpt release). |
Looks like CS has its own copy: https://github.com/GoogleContainerTools/kpt-config-sync/blob/main/pkg/client/restconfig/restconfig.go#L64 |
* Disable client-side throttling for resource discovery if server-side throttling is enabled
I'm using
kpt live apply
for a bunch of CRDs. However, as the number of CRDs increase, I'm finding it near impossible to use it because I'm getting throttled after every CRD is applied.Turns out the client is doing a discovery process for every single CRD that's present on the api-server after every new CRD is applied. This is not a problem exclusive to
kpt
and there are some issues in thekubectl
project to address this: kubernetes/kubectl#1126 kubernetes/kubernetes#105520 kubernetes/kubernetes#107131Disabling it completely is probably not a good idea, but I just wanted to get the ball rolling to address this issue and adjust these parameters.