Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 make clusterctl REST client throttling configurable #8411

Closed
wants to merge 1 commit into from

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Mar 28, 2023

What this PR does / why we need it:

When running commands like clusterctl init on a cluster with many CRDs installed, the API discovery process frequently gets throttled client-side causing operations to take significantly longer. On a fresh kind cluster, I was seeing clusterctl init take about 30s, but when ~100 CRDs are installed it took 5-6m. In this blog post from Upbound (maintainers of Crossplane), they mention they ran into the same problem with kubectl and contributed a fix upstream to bump the rest.Config's Burst from 100 to 300. With the same change applied here, I see the same clusterctl init command go from taking 5-6m to under 1m with minimal log messages indicating client-side throttling is being applied.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added the area/clusterctl Issues or PRs related to clusterctl label Mar 28, 2023
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 28, 2023
Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good idea to me - one question about the use case though - as long as you don't mind sharing 🙂

Is there a reason there's so many additional CRDs installed before clusterctl init?

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Mar 28, 2023

This seems like a good idea to me - one question about the use case though - as long as you don't mind sharing 🙂

Is there a reason there's so many additional CRDs installed before clusterctl init?

Sure thing. In CAPZ we're considering adopting Azure Service Operator (ASO) to create/update/delete individual resources in Azure (CAPZ proposal) and ASO defines one CRD per Azure resource type. And right now there's not a great way to install only the ASO CRDs we need.

@killianmuldoon
Copy link
Contributor

Thanks for explaining - definitely seems reasonable. The only question is whether to default the burst higher or make it configurable with a flag / config file. I think clusterctl maintainers and reviewers will have clearer opinions on that.

@CecileRobertMichon
Copy link
Contributor

Should this be classified as a bug fix and considered for cherry-pick?

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Mar 29, 2023

The only question is whether to default the burst higher or make it configurable with a flag / config file. I think clusterctl maintainers and reviewers will have clearer opinions on that.

we could do both? I think a higher default is sane, given a user installing several infra providers could quickly run into these limits (and that the current value is lower than the kubectl default kubernetes/kubernetes#105520)

@killianmuldoon
Copy link
Contributor

Should this be classified as a bug fix and considered for cherry-pick?

I think it's a good idea to cherry-pick for sure. @nojnhuh could you add a command line flag making this configurable?

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Mar 30, 2023

I think it's a good idea to cherry-pick for sure. @nojnhuh could you add a command line flag making this configurable?

Yep, will do. Should I do the same for the rest config's QPS since that's closely related?

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Mar 30, 2023

Yep, will do. Should I do the same for the rest config's QPS since that's closely related?

Sounds good to me.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Mar 30, 2023

Had to add most of the plumbing to expose these as CLI flags which kinda blew up this PR. Definitely took a hit to my keyboard's Cmd, C, and V keys but it works at least.

/retitle 🐛 make clusterctl REST client throttling configurable

@k8s-ci-robot k8s-ci-robot changed the title 🌱 increase clusterctl client throttle burst 🐛 make clusterctl REST client throttling configurable Mar 30, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 30, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Apr 3, 2023

@killianmuldoon Thanks for taking a look so far! Could you help assign folks you think might be good to review this?

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @Jont828

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Is there anywhere in the book these options should be documented?

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Apr 4, 2023

I don't see anything like generated documentation for every flag on every command, so maybe the built-in --help is enough here?

Copy link
Contributor

@ykakarap ykakarap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the defaults to match upstream (kubectl) sounds reasonable.

The overall approach looks good. I will test it later this week.

I don't see anything like generated documentation for every flag on every command, so maybe the built-in --help is enough here?

Correct. We do not have any explicit documentation for the flags of all the commands. The flag descriptions included in the built-in --help are enough.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Apr 11, 2023

/assign @ykakarap

Friendly ping on this if you wouldn't mind taking a look.

cmd/clusterctl/cmd/delete.go Outdated Show resolved Hide resolved
cmd/clusterctl/cmd/delete.go Outdated Show resolved Hide resolved
cmd/clusterctl/cmd/delete.go Outdated Show resolved Hide resolved
cmd/clusterctl/cmd/move.go Outdated Show resolved Hide resolved
cmd/clusterctl/cmd/topology_plan.go Show resolved Hide resolved
@fabriziopandini
Copy link
Member

/lgtm
Happy to approve as soon as @ykakarap of @Jont828 of other maintainers/reviewers in this area give a +1

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Apr 19, 2023

Test flake already has an issue open: #8478

@sbueringer
Copy link
Member

sbueringer commented Apr 22, 2023

I wonder if an alternative to the current implementation in this PR is to use the lazy rest mapper that CR introduced as an experimental feature (and we exposed as an experimental feature on our controllers).

The lazy restmapper only discovers CRDs on-demand when they are actually used. This way we could just reduce the amount of requests instead of changing the throttling.

xref:

@nojnhuh
Copy link
Contributor Author

nojnhuh commented May 1, 2023

@sbueringer Do you think it might make sense to integrate the lazy rest mapper separately so users can take advantage of the increased throttling thresholds without having to opt in to experimental functionality?

@sbueringer
Copy link
Member

sbueringer commented May 2, 2023

I'm fine with it but I would like to leave the decision to clusterctl maintainers.

I just thought it's better to make it unnecessary to make throttling configurable by optimizing the calls. While lazy restmapper is experimental we use it since ~ 1-2 months in all our e2e test jobs (1.4 + main IIRC) without any problems.

Also looks like it would help us to avoid having to make changes in a lot of places to make it configurable.

It feels like something that we shouldn't require to have configurable based on what clusterctl is doing (deploying providers) to be honest

@sbueringer
Copy link
Member

^^ @vincepri WDYT?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 2, 2023
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@sbueringer
Copy link
Member

sbueringer commented May 2, 2023

Thought experiment:

Let's assume the lazy restmapper becomes the default/only restmapper in CR v0.15 (to which we will bump in 2 weeks). Would we still want the flags additionally?

@nojnhuh
Copy link
Contributor Author

nojnhuh commented May 2, 2023

@ykakarap @Jont828 Thoughts on using the lazy restmapper instead of or in addition to these changes?

@sbueringer
Copy link
Member

Context: I'm asking because we're about to make the lazy restmapper the default restmapper: kubernetes-sigs/controller-runtime#2296

@fabriziopandini
Copy link
Member

Given that it getting the default in two weeks, the lazy rest mapper makes more and more sense...

@sbueringer
Copy link
Member

We have it as soon as #8007 is merged, so we could try how it changes the situation for clusterctl

@nojnhuh
Copy link
Contributor Author

nojnhuh commented May 9, 2023

@sbueringer Do you expect that as soon as controller-runtime is bumped, then CAPI will automatically start using the lazy restmapper and this issue I was originally running into might go away entirely? Or will there be additional work to take advantage of the lazy restmapper?

@ykakarap
Copy link
Contributor

ykakarap commented May 9, 2023

@sbueringer Do you expect that as soon as controller-runtime is bumped, then CAPI will automatically start using the lazy restmapper and this issue I was originally running into might go away entirely? Or will there be additional work to take advantage of the lazy restmapper?

Looks like after bumping controller-runtime (#8007) clusterctl client should default to using the lazy rest mapper. It would be better to perform some tests (similar to how you got the initial metrics of 5-6m ) to see if it addresses the need.
That combined with increasing the defaults to match upstream values should be a good first place to start.
Are there any strong needs to make them configurable?

@nojnhuh
Copy link
Contributor Author

nojnhuh commented May 9, 2023

Thanks @ykakarap, I'll retest once the controller-runtime version bump merges. AFACT kubectl doesn't make these values configurable, so I think we can reasonably leave the values hardcoded to what kubectl has in clusterctl at least for the short- to medium-term. If increasing the default burst does continue to make things significantly faster with the lazy restmapper, I'd be in favor of making that change by itself and making the values configurable if need be in a separate PR. I'll follow up here once I've done that testing with controller-runtime 0.15.

@ykakarap
Copy link
Contributor

ykakarap commented May 9, 2023

Sounds good. Thank you :)

@sbueringer
Copy link
Member

Sounds good. As of controller-runtime v0.15 the lazy restmapper will be the only restmapper so it will be used out of the box

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2023
@sbueringer
Copy link
Member

@nojnhuh Just fyi, CR bump has been merged now if you want to test again :)

@nojnhuh
Copy link
Contributor Author

nojnhuh commented May 25, 2023

@nojnhuh Just fyi, CR bump has been merged now if you want to test again :)

Just tried this out and it's definitely lots better than before! clusterctl init still seems a few seconds slower overall when lots of CRDs are installed but I think that's reasonable in our case. Thanks @sbueringer for the CR bump!

/close

@k8s-ci-robot
Copy link
Contributor

@nojnhuh: Closed this PR.

In response to this:

@nojnhuh Just fyi, CR bump has been merged now if you want to test again :)

Just tried this out and it's definitely lots better than before! clusterctl init still seems a few seconds slower overall when lots of CRDs are installed but I think that's reasonable in our case. Thanks @sbueringer for the CR bump!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nojnhuh nojnhuh deleted the throttle branch May 25, 2023 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterctl Issues or PRs related to clusterctl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants