Resources representing MachinePool Machines #4063

devigned · 2021-01-11T16:29:59Z

User Story

As a user I would like to be able to see individual MachinePool machine resources and be able to manipulate them.

Detailed Description

For example, as a user I would like to be able to delete a specific instance, machine, in a machine pool. Currently, the only way to remove a machine from a machine pool is to decrease the replica count and hope the machine is deleted.

For example, as a user I would like to be able to see the status of a specific machine in a machine pool. What's the running state of the machine. What conditions are set on the a machine?

This sounds like a relatively simple request, but quickly explodes into a larger MachinePool design conversation.

Call all MachinePool providers be expected to delete a machine individually?
- If an instance can be deleted what guarantees are there enough machines in a region or zone to still achieve high-availability. Who is responsible for zonal balancing, the infrastructure service (auto scale groups, virtual machine scale sets, etc) or CAPI?
- Do we need to know about pod disruption budgets?
- Does each provider need to know how to cordon and drain machine pool nodes, or is there a common facility in CAPI?
Does a Machine need to be more flexible about how it references the infrastructure provider? Should machine instead reference a corev1.Node and use that as the level of abstraction?
Does this have an impact on machine health checks?

/kind feature

The text was updated successfully, but these errors were encountered:

vincepri · 2021-01-11T17:02:04Z

From the conversation we had on Friday there were a few ideas on the table, in this post I'm going to express the potential proposal I've made during the meeting to reuse the current Machine object to represent a MachinePool's Machine.

Machine controller

The Machine controller today relies on an infrastructureRef to understand when it's in Ready state, and what's the ProviderID it should work against. A Machine is intended to become a Kubernetes Node.

The controller, after a status.NodeRef has been assigned, operates on the Node itself, and does so by potentially adding labels to it, reconciling conditions coming from the node.

Upon deletion, the Machine controller deletes all linked resources first (infrastructureRef, bootstrapRef, etc), then proceed to cordon and drain the node before removing its finalizer. We also support pre-drain and pre-delete hooks, where folks can add annotations to delay the deletion of a Machine.

Machine in a MachinePool

Using Machine objects in a MachinePool has a few benefits:

Reuse all the existing code (which has been tested in production for a while) and behaviors listed above.
Add the ability to delete a single Machine from a MachinePool.
Adapt Cluster Autoscaler to support MachinePool by reusing the same code paths that today are being used with MachineDeployments.
Allow MachineHealthCheck to be used with MachinePool's Machines.

On the flip side:

MachinePool infrastructure controllers must now understand how to delete a single Machine from a group, if they cannot do it, they shouldn't opt in into creating Machines which can be counter intuitive. For this point, we should seek more information on the current state of MachinePool infrastructure implementations out there, and draw some conclusions.
When deleting a single machine, we won't rely on a scale down from the scale-set resource from a cloud provider.

Let's explore this point a bit more:
- If the replicas count hasn't changed
  - And a Machine is deleted
  - The MachinePool backing implementation (usually cloud provider scale-set) should detect that there are fewer than available replicas, and scale back up
  - If the Machine that has been deleted caused a misbalance, we can probably safely assume that the scale-set cloud resource is going to place a newly added replica in the right place.
- If the replicas count is being decreased
  - And a Machine is deleted
  - The MachinePool backing implementation (usually cloud provider scale-set) shouldn't proceed with any other operations, keeping the count stable in this case.

Alternatives considered

Some folks pointed out that we could build a controller (which could run in the workload cluster) that watches corev1.Node objects, and move most of the logic above to it.

While this is definitely something to explore, I'm a bit more comfortable to use the existing and proven resources and controllers, instead of building new ones from scratch. A new deployment that needs to run in the workload cluster directly might bring more complications to the overall systems, in terms of version management and lockstep behaviors.

Mostly a core-dump, hopefully this helps 😄

vincepri · 2021-01-11T17:02:39Z

/area api
/priority important-longterm
/milestone Next
(hopefully we can address this in v0.4.0, although setting Next for now until we have some consensus)

dthorsen · 2021-01-21T16:26:13Z

In our current clusters, we are using cluster-autoscaler with the AWS provider in combination with MachinePools. This greatly simplifies the code in cluster-api and the infra provider since all the instance management logic is effectively delegated to the cloud provider and cluster-autoscaler. When we wish to delete an instance from an autoscaling group, we simply drain the node. cluster-autoscaler detects that the node is empty, and it automatically deletes that instance roughly 10 minutes later. The use case I originally saw for MachinePools, and their distinction from MachineDeployments is that they delegate much of this logic to the cloud provider. If we make the MachinePools logic so similar to MachineDeployments as is proposed above, I wonder why have MachinePools at all, and we could just simply use MachineDeployments.

CecileRobertMichon · 2021-02-01T21:53:29Z

I think @dthorsen makes a really great point and something we should not forget while discussing these changes. The main goal of MachinePools is to allow delegating functionality to the cloud providers directly (zones, scaling, etc.). If we re-implement scale down/draining/etc. via Machines for MachinePools, we lose this completely. Perhaps a better solution is to have cloud specific autoscaler (and k8s upgrade) implementations for MachinePools since the idea is to stay as close to the infra as possible.

dthorsen · 2021-03-17T14:55:50Z

@CecileRobertMichon @devigned @h0tbird To this end, we have added an externallyManagedReplicaCount field to the MachinePool spec and we are currently testing this in our development environments. Infrastructure providers could use this field to determine which mode they are running in, CAPI controlled replicas or externally-controlled replicas. We will PR this soon.

vincepri · 2021-03-17T15:00:41Z

@dthorsen The logic explained above makes sense, although I'd ask for a small amendment to the MachinePool proposal that describes the field in more details.

fejta-bot · 2021-06-15T15:22:18Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

vincepri · 2021-06-15T15:36:57Z

/remove-lifecycle stale

CecileRobertMichon · 2021-06-16T17:13:09Z

/assign @devigned @dthorsen

k8s-triage-robot · 2021-09-14T17:20:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

CecileRobertMichon · 2021-09-14T17:23:15Z

/lifecycle active

devigned · 2021-09-14T17:43:12Z

/assign @mboersma

mboersma · 2022-11-07T16:25:16Z

/remove-lifecycle stale

fabriziopandini · 2022-11-07T17:38:23Z

/lifecycle frozen
/help

k8s-ci-robot · 2022-11-07T17:38:24Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/lifecycle frozen
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2022-11-30T17:00:31Z

Removing it from the 1.2 milestone as it wasn't implemented in the 1.2 time frame

sbueringer · 2023-07-11T14:06:32Z

I think we should consider follow-up tasks now that the first iteration is merged (e.g. MHC support, in-line label propagation, ...)

sbueringer · 2023-07-11T14:06:37Z

/reopen

k8s-ci-robot · 2023-07-11T14:06:45Z

@sbueringer: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dtzar · 2023-07-17T16:43:03Z

@sbueringer - can we close this issue out once #8842 is merged or close it now? And add new issue(s) based on more details of what else you'd like to see enhanced/added? The issue is very old and we have delivered on the core of the ask originally IMO.

sbueringer · 2023-07-17T17:37:26Z

Sure, fine for me.

It would probably make sense to create an umbrella issue with an overview over the next iteration

dtzar · 2023-07-17T20:52:44Z

Ok, I will look to you for this umbrella issue. Related to #9005

sbueringer · 2023-07-18T14:52:12Z

It will probably take considerable time until I find the time to create an umbrella issue for MachinePool Machines (especially to do the necessary research to figure out which parts are missing between implementation / the proposal / and to figure out what a MachinePool Machine at the moment doesn't support compared to a "normal" (KCP/MD/MS) Machine. Limited bandwith right now.

But feel free to close this issue, I didn't want to block. Just thought we wanted to do some follow-ups based on the discussions we had on the MachinePool Machine PR.

dtzar · 2023-07-18T19:05:36Z

Sounds good, thanks @sbueringer. Jonathan has it set to close automatically when #8842 merges.

Jont828 · 2023-07-31T21:44:43Z

@sbueringer I've also opened #9096 just to track the CAPD implementation PR since this issue has gotten really old, and we've separated the task of delivering in CAPI core, clusterctl, and CAPD. IIRC this issue doesn't seem to be about adding support in the test provider, so I think we could move #9096 under an umbrella issue and close this one out. WDYT?

sbueringer · 2023-08-02T15:55:42Z

Sounds good to me

Jont828 · 2023-08-09T21:42:31Z

/close

k8s-ci-robot · 2023-08-09T21:42:35Z

@Jont828: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 11, 2021

k8s-ci-robot added this to the Next milestone Jan 11, 2021

k8s-ci-robot added area/api Issues or PRs related to the APIs priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jan 11, 2021

vincepri mentioned this issue Jan 20, 2021

Decouple Infrastructure resource from infra providers machine controller #4095

Closed

2 tasks

randomvariable mentioned this issue Jan 21, 2021

AWSMachinePool does not work in combination with cluster-autoscaler kubernetes-sigs/cluster-api-provider-aws#2022

Closed

This was referenced Feb 17, 2021

📖 Add AWSFargateProfile ADR kubernetes-sigs/cluster-api-provider-aws#2250

Merged

🌱 Enable ClusterResourceSet by default #4213

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2021

k8s-ci-robot assigned devigned and dthorsen Jun 16, 2021

devigned mentioned this issue Jun 17, 2021

cordon and drain azuremachinepoolmachines prior to delete kubernetes-sigs/cluster-api-provider-azure#1435

Merged

3 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 14, 2021

k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 14, 2021

k8s-ci-robot assigned mboersma Sep 14, 2021

devigned removed their assignment Sep 14, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 6, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 7, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Nov 7, 2022

sbueringer removed this from the v1.2 milestone Nov 30, 2022

CecileRobertMichon assigned Jont828 Jan 10, 2023

Jont828 mentioned this issue Jan 17, 2023

✨ Implement MachinePool Machines in CAPI, CAPD, and clusterctl #7938

Closed

This was referenced Jun 8, 2023

✨ Add MachinePool Machine implementation in core CAPI components #8828

Merged

✨ Add MachinePool Machine support in clusterctl discovery #8836

Merged

✨ Add MachinePool Machine implementation to CAPD components #8842

Merged

k8s-ci-robot closed this as completed in #8836 Jul 10, 2023

k8s-ci-robot reopened this Jul 11, 2023

k8s-ci-robot closed this as completed Aug 9, 2023

AndiDog mentioned this issue Nov 22, 2023

CAPA node draining - Phase 2 giantswarm/roadmap#2975

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resources representing MachinePool Machines #4063

Resources representing MachinePool Machines #4063

devigned commented Jan 11, 2021

vincepri commented Jan 11, 2021

vincepri commented Jan 11, 2021

dthorsen commented Jan 21, 2021

CecileRobertMichon commented Feb 1, 2021

dthorsen commented Mar 17, 2021

vincepri commented Mar 17, 2021

fejta-bot commented Jun 15, 2021

vincepri commented Jun 15, 2021

CecileRobertMichon commented Jun 16, 2021

k8s-triage-robot commented Sep 14, 2021

CecileRobertMichon commented Sep 14, 2021

devigned commented Sep 14, 2021

mboersma commented Nov 7, 2022

fabriziopandini commented Nov 7, 2022

k8s-ci-robot commented Nov 7, 2022

sbueringer commented Nov 30, 2022

sbueringer commented Jul 11, 2023

sbueringer commented Jul 11, 2023

k8s-ci-robot commented Jul 11, 2023

dtzar commented Jul 17, 2023

sbueringer commented Jul 17, 2023

dtzar commented Jul 17, 2023

sbueringer commented Jul 18, 2023 •

edited

Loading

dtzar commented Jul 18, 2023

Jont828 commented Jul 31, 2023

sbueringer commented Aug 2, 2023

Jont828 commented Aug 9, 2023

k8s-ci-robot commented Aug 9, 2023

Resources representing MachinePool Machines #4063

Resources representing MachinePool Machines #4063

Comments

devigned commented Jan 11, 2021

vincepri commented Jan 11, 2021

Machine controller

Machine in a MachinePool

Alternatives considered

vincepri commented Jan 11, 2021

dthorsen commented Jan 21, 2021

CecileRobertMichon commented Feb 1, 2021

dthorsen commented Mar 17, 2021

vincepri commented Mar 17, 2021

fejta-bot commented Jun 15, 2021

vincepri commented Jun 15, 2021

CecileRobertMichon commented Jun 16, 2021

k8s-triage-robot commented Sep 14, 2021

CecileRobertMichon commented Sep 14, 2021

devigned commented Sep 14, 2021

mboersma commented Nov 7, 2022

fabriziopandini commented Nov 7, 2022

k8s-ci-robot commented Nov 7, 2022

Guidelines

sbueringer commented Nov 30, 2022

sbueringer commented Jul 11, 2023

sbueringer commented Jul 11, 2023

k8s-ci-robot commented Jul 11, 2023

dtzar commented Jul 17, 2023

sbueringer commented Jul 17, 2023

dtzar commented Jul 17, 2023

sbueringer commented Jul 18, 2023 • edited Loading

dtzar commented Jul 18, 2023

Jont828 commented Jul 31, 2023

sbueringer commented Aug 2, 2023

Jont828 commented Aug 9, 2023

k8s-ci-robot commented Aug 9, 2023

sbueringer commented Jul 18, 2023 •

edited

Loading