Cluster Autoscaler CAPI provider should support scaling to and from zero nodes #3150

elmiko · 2020-05-21T19:07:23Z

As a user I would like the ability to have my MachineSets and MachineDeployments scale to and from zero replicas. I should be able to set a minimum size of 0 for a Machine[Set|Deployment] and have the autoscaler take the appropriate actions.

This issue is CAPI provider specific, and will require some modifications to the individual CAPI providers before it could be merged in the autoscaler code.

The text was updated successfully, but these errors were encountered:

elmiko · 2020-05-21T19:07:55Z

/area provider/cluster-api

seh · 2020-05-27T18:53:57Z

How will the autoscaler determine which labels and taints to expect on nodes for its scheduling simulation? I see the taints may be available in the kubeadm NodeRegistrationOptions type.

elmiko · 2020-05-28T13:40:01Z

@seh, if i understand your question correctly this is information is handled through the labels and taints on the MachineSets and MachineDeployments. when these resources are set to a minimum size of 0 and the autoscaler has removed all the Machines and Nodes, the MachineSets and MachineDeployments contain labels and taints which are used during the scale up process. the labels and taints will be applied to the new Node resources as they are created.

seh · 2020-05-28T14:17:47Z

Before asking that question, when I went looking at the newest MachineSet type definition, I didn't see anything there about taints. Drilling down further into MachineSpec, it's not there either.

The only place I could find them was in the kubeadm NodeRegistrationOptions type. That's why I asked where you'll find the taints. Did I miss a pertinent field here?

elmiko · 2020-05-28T14:33:40Z

Did I miss a pertinent field here?

no, i don't think you missed something, i think i may have missed something ;)

i have been working from a branch of the cluster-api code to test this behavior locally and with openshift. to make this work in our branch, we have the Taints persisted through at the MachineSpec level. i think there will need to be some work done in the cluster-api project to expose this functionality, or at least a little deeper research.

there are other changes that will need to happen in CAPI as well, mainly around saving information about cpu/memory/gpu as well. your point about the taints is well placed though, i will add this to the list of changes.

seh · 2020-05-28T14:37:13Z

For the machine resources, I figured that we'd do something like dive down to figure out the cloud provider and machine/instance type, and then consult the catalogs available elsewhere within the cluster autoscaler. I'm most familiar with AWS, and for that provider there used to be a static (generated) catalog, but now we fetch it dynamically via the AWS API when the program starts. With that catalog, you can learn of the machine's promised capabilities.

Perhaps, though, in the interest of eliminating dependencies among providers, the Cluster API provider would be blind to that information, which is would be an unfortunate loss.

elmiko · 2020-05-28T14:43:27Z

for the machine/instance resources, the solution i am working from currently is that the individual providers on the CAPI side will populate annotations in the Machine[Set|Deployment] that instruct on the cpu, memory, gpu, etc.

the method i am currently using has lookup tables for each provider (contained within the provider code) to assist in creating the resource requirements. i think having these values be dynamically populated by the CAPI side of things would certainly be worth looking into. ultimately though, the idea would be for each provider to own their implementation of the resource requirements, with a group of standard annotations that the autoscaler can use to assist in creating the machines for that group.

the information does come from the CAPI providers though, not from the autoscaler providers.

seh · 2020-05-28T14:52:09Z

Understood. So long as it's all accurate and not too hard to maintain, that sounds fine.

What we ran into with the AWS provider for the autoscaler was that the catalog would fall out of step, which required generating fresh code, releasing a new autoscaler version, and then deploying that new container image version into clusters. AWS was coming out with new instance types often enough that that whole process felt too onerous. It seems that these new instance types come in waves. It's hard to balance the threat of falling out of data with the threat of the catalog fetching and parsing failing at run time.

elmiko · 2020-05-28T14:57:32Z

that's an excellent point about the catalog falling out of step. if i understand the provider implementations though CAPI properly, and i might not ;) , we are using values for cpu, memory, etc, that the individual CAPI providers then turn into actual instance information at the cloud provider layer. so, in theory, this could be a call to the CAPI provider at creation time, eg. "give me a Machine that has X cpu slices, Y ram, and Z gpus" then the CAPI provider could either use a lookup table if appropriate or make some dynamic call to the cloud provider api.

edit: added some context to the overloaded "provider" terms

seh · 2020-06-10T01:32:17Z

to make this work in our branch, we have the Taints persisted through at the MachineSpec level. i think there will need to be some work done in the cluster-api project to expose this functionality, or at least a little deeper research.

Are there any open CAPI issues about this gap? Do you know if anyone is working on exposing the node taints and labels there? (Perhaps we can already get the labels.)

elmiko · 2020-06-10T12:52:52Z

Are there any open CAPI issues about this gap? Do you know if anyone is working on exposing the node taints and labels there? (Perhaps we can already get the labels.)

i do not think issues have been opened on the CAPI side yet, there will need to be some discussion there about passing information about the node sizes through the CAPI resources. i am working from a proof of concept that has this working for aws, gcp, and azure, in which we use annotations for passing this information.

ideally i would like to contribute these patches back to the CAPI project, and bring the associated changes here as well, but i think we need to have a discussion on the CAPI side about this as it will require changes to several repos and some agreement about the method for passing information.

and we haven't even touched on the taints yet ;)

seh · 2020-06-10T13:31:53Z

i think we need to have a discussion on the CAPI side about this as it will require changes to several repos and some agreement about the method for passing information.

Would you mind if I bring this up for discussion in the "cluster-api" Slack channel? I'd like to get a feel for how much work and resistance lies ahead, as I don't think we can adopt the cluster autoscaler with CAPI until we close this gap.

elmiko · 2020-06-10T14:14:19Z

Would you mind if I bring this up for discussion in the "cluster-api" Slack channel? I'd like to get a feel for how much work and resistance lies ahead, as I don't think we can adopt the cluster autoscaler with CAPI until we close this gap.

please do!

if you'd like, we can bring this up during the weekly meeting today as well?

elmiko · 2020-06-10T18:20:18Z

@seh just wanted to let you know that we talked about this at the CAPI meeting today, i don't think we have consensus yet but i didn't hear any hard objections. i think the next steps will be to do a little research around some other approaches to gather the cpu/mem/gpu requirements, and then create an enhancement proposal to discuss with the CAPI team.

CAPI meeting minutes 2020-06-10

seh · 2020-06-10T19:25:42Z

That's great to hear. I'm sorry I wasn't able to attend the meeting today. I do see the topic covered in the agenda/minutes, though, so thank you for bringing it up.

I don't know yet what I can do to help make progress on this front. I have experience with kubeadm and the cluster autoscaler, but little with CAPI and CAPA so far. If you'd like review or help with the KEP, please let me know.

elmiko · 2020-06-10T19:42:12Z

I don't know yet what I can do to help make progress on this front. I have experience with kubeadm and the cluster autoscaler, but little with CAPI and CAPA so far. If you'd like review or help with the KEP, please let me know.

i think the next steps will be to make a formal proposal to the CAPI group for getting this change into their releases, and then coordinating the autoscaler changes. i'm happy to CC you on any issues that come up around this, and perhaps we can work to get them merged. if you are interested in getting more involved with the CAPI provider code, i'm sure we could collaborate on getting the necessary changes in place.

seh · 2020-06-17T12:13:35Z

I brought up some of these questions in the "cluster-api" Slack channel. See kubernetes-sigs/cluster-api#2461 for an overlapping request.

fejta-bot · 2020-09-15T13:06:48Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

elmiko · 2020-09-15T14:49:47Z

/remove-lifecycle stale

unixfox · 2020-12-03T20:53:34Z

Hello,

Sorry for the noise but I just wanted to say that I'm also interested into this issue for mostly deploying temporary workloads like Minecraft servers, coding environments (like GitHub codespaces) and more.

This would also close the gap even further for the features available between self-hosted autoscaler and autoscaler from managed Kubernetes solutions like DigitalOcean.
For instance thanks to some projects like machine-controller that implements cluster-api, it's possible to use our own autoscaler on DigitalOcean and even on unsupported cloud providers like Scaleway, Hetzner, Linode and more.

elmiko · 2020-12-03T21:08:15Z

@unixfox just by means of an update, i have been working on a proof of concept for scaling from zero with capi. it's been going slower than i expected, but i feel we have good consensus about the initial implementation and with any luck 🍀 i should have something to show in early january.

fejta-bot · 2021-03-03T21:17:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

unixfox · 2021-03-03T21:34:14Z

/remove-lifecycle stale

fejta-bot · 2021-06-08T18:41:21Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

unixfox · 2021-06-08T18:46:49Z

/remove-lifecycle stale

elmiko · 2021-06-08T19:54:46Z

thanks for the bump @unixfox , i continue to hack away on this. the design has changed slightly since the first round of work on the enhancement. i need to update the enhancement and would like to give a demo at an upcoming cluster-api meeting.

k8s-triage-robot · 2021-09-06T20:04:58Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

unixfox · 2021-09-06T21:17:49Z

Not sure if I would mark this issue as fresh or not. I stopped having the need for cluster autoscaler with Cluster API, but it's still a cool feature that have a lot of potential when trying to use cluster autoscaler on "unsupported" cloud providers.

elmiko · 2021-09-07T19:07:42Z

i am still working towards this issue. we almost have agreement on the cluster-api enhancement, and i think it will merge in the next few weeks. then i will post a PR for the implementation.

@unixfox sorry to hear that we weren't able to deliver this feature in a time that would be helpful to you. i do appreciate your support though =)

/remove-lifecycle stale

k8s-triage-robot · 2021-12-14T17:03:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elmiko · 2021-12-14T17:41:42Z

the upstream cluster-api community has approved the proposal for the scale-from-zero feature. i am in the process of writing a patch that will satisfy the proposal, and also updating the kubemark provider to work with scaling from zero. i imagine this work won't be done till january, hopefully we will have it in for the 1.24 release of the autoscaler.

/remove-lifecycle stale

davidspek · 2022-03-11T12:39:49Z

@elmiko Do you have a link to a PR we can follow?

elmiko · 2022-03-11T14:38:47Z

@davidspek i am hoping to have the PR ready next week, you can follow my progress on this branch for now https://github.com/elmiko/kubernetes-autoscaler/tree/capi-scale-from-zero

i have it working, but i need to do some cleanups around the dynamic nature of the client, and also add some unit tests. there is a complicated problem to solve wherein we need the client to become aware of the machine template types after it has started watching machinedeployments/machinesets, so that we can accurately set up the informers to watch the templates. i have the basic mechanism working on my branch, i'm just trying to make the dynamic client better now.

davidspek · 2022-04-07T18:50:35Z

@elmiko Thanks for the info. I hope to have some time to test your changes soon. Do you maybe have a link to the Cluster API docs for infrastructure providers to support scale from 0? I haven’t been able to find that myself.

elmiko · 2022-04-07T18:53:48Z

@davidspek my hope is that the enhancement[0] has enough details for a provider to implement scale from zero. if you find that there is detail lacking, please ping me as i would like to improve that doc =)

[0] https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md

davidspek · 2022-04-07T18:57:59Z

@elmiko Thanks for the doc, I think that’ll likely answer most of my questions. Has that proposal already been accepted? Or more importantly, can this already be implemented in infrastructure providers without needing to change anything in the cluster api core library?

elmiko · 2022-04-07T19:00:09Z

@davidspek yes it has been accepted, and no it should not require any changes in the core cluster-api.

i was able to implement scale from zero in the kubemark provider without modifying the core, you can see my PR here kubernetes-sigs/cluster-api-provider-kubemark#30

davidspek · 2022-04-07T19:21:34Z

@elmiko Awesome, thank you very much for all the info and quick responses.

k8s-triage-robot · 2022-07-06T20:19:44Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elmiko · 2022-07-06T20:35:02Z

PR is currently under review for this, #4840
/remove-lifecycle stale

k8s-ci-robot added the area/provider/cluster-api Issues or PRs related to Cluster API provider label May 21, 2020

seh mentioned this issue Jul 15, 2020

📖 Add cluster autoscaler scale from zero ux proposal kubernetes-sigs/cluster-api#2530

Closed

elmiko mentioned this issue Aug 21, 2020

Cluster Autoscaler CAPI provider only supports merged clusters #3196

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 6, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2021

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021

davidspek mentioned this issue Apr 21, 2022

Add cluster autoscaler scale from 0 support kubernetes-sigs/cluster-api-provider-packet#327

Open

This was referenced Apr 29, 2022

Allow Cluster API node groups to scale to and from zero #4758

Closed

clusterapi scale from zero support #4840

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2022

k8s-ci-robot closed this as completed in #4840 Aug 18, 2022

Cluster Autoscaler CAPI provider should support scaling to and from zero nodes #3150

Cluster Autoscaler CAPI provider should support scaling to and from zero nodes #3150

Comments

elmiko commented May 21, 2020 • edited Loading

elmiko commented May 21, 2020

seh commented May 27, 2020

elmiko commented May 28, 2020

seh commented May 28, 2020

elmiko commented May 28, 2020

seh commented May 28, 2020

elmiko commented May 28, 2020

seh commented May 28, 2020

elmiko commented May 28, 2020 • edited Loading

seh commented Jun 10, 2020

elmiko commented Jun 10, 2020

seh commented Jun 10, 2020

elmiko commented Jun 10, 2020

elmiko commented Jun 10, 2020

seh commented Jun 10, 2020

elmiko commented Jun 10, 2020

seh commented Jun 17, 2020

fejta-bot commented Sep 15, 2020

elmiko commented Sep 15, 2020

unixfox commented Dec 3, 2020

elmiko commented Dec 3, 2020

fejta-bot commented Mar 3, 2021

unixfox commented Mar 3, 2021

fejta-bot commented Jun 8, 2021

unixfox commented Jun 8, 2021

elmiko commented Jun 8, 2021

k8s-triage-robot commented Sep 6, 2021

unixfox commented Sep 6, 2021 • edited Loading

elmiko commented Sep 7, 2021

k8s-triage-robot commented Dec 14, 2021

elmiko commented Dec 14, 2021

davidspek commented Mar 11, 2022

elmiko commented Mar 11, 2022

davidspek commented Apr 7, 2022

elmiko commented Apr 7, 2022

davidspek commented Apr 7, 2022

elmiko commented Apr 7, 2022

davidspek commented Apr 7, 2022

k8s-triage-robot commented Jul 6, 2022

elmiko commented Jul 6, 2022

elmiko commented May 21, 2020 •

edited

Loading

elmiko commented May 28, 2020 •

edited

Loading

unixfox commented Sep 6, 2021 •

edited

Loading