clusterapi scale from zero support #4840

elmiko · 2022-04-29T18:29:31Z

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds scale from zero capability to the clusterapi provider for cluster-autoscaler.

Which issue(s) this PR fixes:

Fixes #3150

Special notes for your reviewer:

The cluster-api-provider-kubemark has implemented the provider part of the opt-in API for scaling from zero. I recommend using this as a way to test with a live cluster.

Does this PR introduce a user-facing change?

Added support for scaling to and from zero nodes for the cluster autoscaler's Cluster API provider. Enabling this feature will require changes by the user, for instruction please see the Cluster API (clusterapi) provider README file in the autoscaler repository.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[CAEP] https://github.com/kubernetes-sigs/cluster-api/blob/release-1.1/docs/proposals/20210310-opt-in-autoscaling-from-zero.md

Usage docs are included in the README file.

k8s-ci-robot · 2022-04-29T18:29:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/clusterapi/OWNERS~~ [elmiko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2022-04-29T18:31:23Z

i am placing a pre-empitve hold on this to allow time for reviews.
/hold for reviews

@enxebre ptal

jackfrancis · 2022-04-29T19:07:36Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go

@@ -226,7 +234,64 @@ func (ng *nodegroup) Nodes() ([]cloudprovider.Instance, error) {
 // node by default, using manifest (most likely only kube-proxy).
 // Implementation optional.
 func (ng *nodegroup) TemplateNodeInfo() (*schedulerframework.NodeInfo, error) {
-	return nil, cloudprovider.ErrNotImplemented
+	if !ng.scalableResource.CanScaleFromZero() {


cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go

elmiko · 2022-05-02T19:44:39Z

updated to remove redundant checks for cpu and memory data

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured_test.go

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_utils_test.go

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup_test.go

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

This allows a Machine{Set,Deployment} to scale up/down from 0, providing the following annotations are set: ```yaml apiVersion: v1 items: - apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "0" machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6" machine.openshift.io/vCPU: "2" machine.openshift.io/memoryMb: 8G machine.openshift.io/GPU: "1" machine.openshift.io/maxPods: "100" ``` Note that `machine.openshift.io/GPU` and `machine.openshift.io/maxPods` are optional. For autoscaling from zero, the autoscaler should convert the mem value received in the appropriate annotation to bytes using powers of two consistently with other providers and fail if the format received is not expected. This gives robust behaviour consistent with cloud providers APIs and providers implementations. https://cloud.google.com/compute/all-pricing https://www.iec.ch/si/binary.htm https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L366 Co-authored-by: Enxebre <alberto.garcial@hotmail.com> Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk> Co-authored-by: Michael McCune <elmiko@redhat.com>

elmiko · 2022-07-18T19:25:42Z

update

rebased
removed infrastructure machine template cache
improve unit tests based on review comments

i am doing some manual testing now to make sure nothing broke, but i think this is ready for re-review

elmiko · 2022-07-18T19:50:23Z

manual testing is working as expected against capi 1.0.5 and kubemark 0.3.0

JoelSpeed · 2022-07-20T13:40:29Z

I've given this a review again and it LGTM, do we want to give others a few days to review and then try to get this merged? Is there anyone specific we want to get to review this before we label it?

enxebre · 2022-07-20T15:07:19Z

manual testing is working as expected against capi 1.0.5 and kubemark 0.3.0
I've given this a review again and it LGTM, do we want to give others a few days to review and then try to get this merged? Is there anyone specific we want to get to review this before we label it?

sgtm, though we definitely need to add some sort of smoke testing to validate PRs here.

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

enxebre · 2022-07-20T15:56:22Z

thanks a lot @elmiko, this looks great to me other than #4840 (comment)

elmiko · 2022-07-20T16:48:04Z

fyi, i'm leaving this on hold so we don't merge early since it's already approved. i will remove the hold once we work out the last discussion item.

This commit is a combination of several commits. Significant details are preserved below. * update functions for resource annotations This change converts some of the functions that look at annotation for resource usage to indicate their usage in the function name. This helps to make room for allowing the infrastructure reference as an alternate source for the capacity information. * migrate capacity logic into a single function This change moves the logic to collect the instance capacity from the TemplateNodeInfo function into a method of the unstructuredScalableResource named InstanceCapacity. This new function is created to house the logic that will decide between annotations and the infrastructure reference when calculating the capacity for the node. * add ability to lookup infrastructure references This change supplements the annotation lookups by adding the logic to read the infrastructure reference if it exists. This is done to determine if the machine template exposes a capacity field in its status. For more information on how this mechanism works, please see the cluster-api enhancement[0]. * add documentation for capi scaling from zero * improve tests for clusterapi scale from zero this change adds functionality to test the dynamic client behavior of getting the infrastructure machine templates. * update README with information about rbac changes this adds more information about the rbac changes necessary for the scale from zero support to work. * remove extra check for scaling from zero since the CanScaleFromZero function checks to see if both CPU and memory are present, there is no need to check a second time. This also adds some documentation to the CanScaleFromZero function to make it clearer what is happening. * update unit test for capi scale from zero adding a few more cases and details to the scale from zero unit tests, including ensuring that the int based annotations do not accept other unit types. [0] https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md

elmiko · 2022-07-23T00:26:52Z

update

removed all references to the custom cache
squashes down to 2 commits

elmiko · 2022-07-26T14:57:33Z

i'm doing some digging around client-go, have found https://github.com/kubernetes/client-go/tree/master/tools/cache but i think we need to add a little more code to make it work. i was under the false impression that the cache was automatic with client-go. hopefully i'll have another patch soon with the cache added.

elmiko · 2022-07-26T16:54:45Z

i talked with @JoelSpeed a little about client-go internals, i think what i need to add here is the ability to add new informers to the client when we observe new infrastructure machine templates. i'm doing a little more digging to see how we can do that with our current setup.

this change adds logic to create informers for the infrastructure machine templates that are discovered during the scale from zero checks. it also adds tests and a slight change to the controller structure to account for the dynamic informer creation.

elmiko · 2022-08-17T20:27:22Z

ok, i have finally understood the magical incantations for client-go to get the dynamic informers created and running for the infrastructure templates. i think this is good to merge whenever folks are happy with the code quality.

/hold cancel

JoelSpeed · 2022-08-18T09:37:11Z

This looks good and I think we've addressed everyones comments. I know Mike has been testing this and was seeing it working. I think it's good to merge now.
/lgtm

clusterapi scale from zero support

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 29, 2022

k8s-ci-robot requested review from enxebre and shysank April 29, 2022 18:29

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2022

elmiko mentioned this pull request Apr 29, 2022

Add cluster autoscaler scale from 0 support kubernetes-sigs/cluster-api-provider-packet#327

Open

elmiko force-pushed the capi-scale-from-zero branch from a762e2f to 9450e75 Compare April 29, 2022 18:50

jackfrancis reviewed Apr 29, 2022

View reviewed changes

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go Outdated Show resolved Hide resolved

jackfrancis reviewed Apr 29, 2022

View reviewed changes

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go Show resolved Hide resolved

JoelSpeed reviewed May 24, 2022

View reviewed changes

enxebre reviewed May 26, 2022

View reviewed changes

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go Outdated Show resolved Hide resolved

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2022

jbartosik added the area/cluster-autoscaler label May 31, 2022

elmiko mentioned this pull request Jul 6, 2022

Cluster Autoscaler CAPI provider should support scaling to and from zero nodes #3150

Closed

elmiko force-pushed the capi-scale-from-zero branch from 57fffa1 to d3edc45 Compare July 18, 2022 19:24

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2022

enxebre reviewed Jul 20, 2022

View reviewed changes

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go Outdated Show resolved Hide resolved

enxebre reviewed Jul 20, 2022

View reviewed changes

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go Outdated Show resolved Hide resolved

elmiko force-pushed the capi-scale-from-zero branch from d3edc45 to 1a65fde Compare July 23, 2022 00:25

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2022

k8s-ci-robot assigned JoelSpeed Aug 18, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 18, 2022

k8s-ci-robot merged commit e478ee2 into kubernetes:master Aug 18, 2022

elmiko deleted the capi-scale-from-zero branch August 18, 2022 12:51

mkjpryor mentioned this pull request Aug 18, 2022

Autoscaling to zero kubernetes-sigs/cluster-api-provider-openstack#1328

Open

com6056 mentioned this pull request Aug 19, 2022

Support autoscaling from zero kubernetes-sigs/cluster-api-provider-aws#3679

Closed

batistein mentioned this pull request Aug 22, 2022

Support autoscaling from zero syself/cluster-api-provider-hetzner#264

Closed

navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022

Merge pull request kubernetes#4840 from elmiko/capi-scale-from-zero

3429502

clusterapi scale from zero support

lukasmrtvy mentioned this pull request Mar 22, 2023

Support scale from zero ( cluster-autoscaler ) kubernetes-sigs/cluster-api-provider-cloudstack#229

Closed

name212 mentioned this pull request Mar 24, 2024

[node-manager] Add permission for cluster api resources for cluster autoscaler and support scale from zero for vcd deckhouse/deckhouse#7893

Merged

4 tasks

deckhouse-BOaTswain mentioned this pull request Mar 24, 2024

Backport: [node-manager] Add permission for cluster api resources for cluster autoscaler and support scale from zero for vcd deckhouse/deckhouse#7894

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterapi scale from zero support #4840

clusterapi scale from zero support #4840

elmiko commented Apr 29, 2022 •

edited

Loading

k8s-ci-robot commented Apr 29, 2022

elmiko commented Apr 29, 2022

jackfrancis Apr 29, 2022

elmiko commented May 2, 2022

elmiko commented Jul 18, 2022

elmiko commented Jul 18, 2022

JoelSpeed commented Jul 20, 2022

enxebre commented Jul 20, 2022

enxebre commented Jul 20, 2022

elmiko commented Jul 20, 2022

elmiko commented Jul 23, 2022

elmiko commented Jul 26, 2022

elmiko commented Jul 26, 2022

elmiko commented Aug 17, 2022

JoelSpeed commented Aug 18, 2022

clusterapi scale from zero support #4840

clusterapi scale from zero support #4840

Conversation

elmiko commented Apr 29, 2022 • edited Loading

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Apr 29, 2022

elmiko commented Apr 29, 2022

jackfrancis Apr 29, 2022

Choose a reason for hiding this comment

elmiko commented May 2, 2022

elmiko commented Jul 18, 2022

elmiko commented Jul 18, 2022

JoelSpeed commented Jul 20, 2022

enxebre commented Jul 20, 2022

enxebre commented Jul 20, 2022

elmiko commented Jul 20, 2022

elmiko commented Jul 23, 2022

elmiko commented Jul 26, 2022

elmiko commented Jul 26, 2022

elmiko commented Aug 17, 2022

JoelSpeed commented Aug 18, 2022

elmiko commented Apr 29, 2022 •

edited

Loading