Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resources representing MachinePool Machines #4063

Closed
devigned opened this issue Jan 11, 2021 · 34 comments · Fixed by #6088, #8828, #8836 or #8842
Closed

Resources representing MachinePool Machines #4063

devigned opened this issue Jan 11, 2021 · 34 comments · Fixed by #6088, #8828, #8836 or #8842
Assignees
Labels
area/api Issues or PRs related to the APIs help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@devigned
Copy link
Contributor

User Story

As a user I would like to be able to see individual MachinePool machine resources and be able to manipulate them.

Detailed Description

For example, as a user I would like to be able to delete a specific instance, machine, in a machine pool. Currently, the only way to remove a machine from a machine pool is to decrease the replica count and hope the machine is deleted.

For example, as a user I would like to be able to see the status of a specific machine in a machine pool. What's the running state of the machine. What conditions are set on the a machine?

This sounds like a relatively simple request, but quickly explodes into a larger MachinePool design conversation.

  • Call all MachinePool providers be expected to delete a machine individually?
    • If an instance can be deleted what guarantees are there enough machines in a region or zone to still achieve high-availability. Who is responsible for zonal balancing, the infrastructure service (auto scale groups, virtual machine scale sets, etc) or CAPI?
    • Do we need to know about pod disruption budgets?
    • Does each provider need to know how to cordon and drain machine pool nodes, or is there a common facility in CAPI?
  • Does a Machine need to be more flexible about how it references the infrastructure provider? Should machine instead reference a corev1.Node and use that as the level of abstraction?
  • Does this have an impact on machine health checks?

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 11, 2021
@vincepri
Copy link
Member

From the conversation we had on Friday there were a few ideas on the table, in this post I'm going to express the potential proposal I've made during the meeting to reuse the current Machine object to represent a MachinePool's Machine.

Machine controller

The Machine controller today relies on an infrastructureRef to understand when it's in Ready state, and what's the ProviderID it should work against. A Machine is intended to become a Kubernetes Node.

The controller, after a status.NodeRef has been assigned, operates on the Node itself, and does so by potentially adding labels to it, reconciling conditions coming from the node.

Upon deletion, the Machine controller deletes all linked resources first (infrastructureRef, bootstrapRef, etc), then proceed to cordon and drain the node before removing its finalizer. We also support pre-drain and pre-delete hooks, where folks can add annotations to delay the deletion of a Machine.

Machine in a MachinePool

Using Machine objects in a MachinePool has a few benefits:

  • Reuse all the existing code (which has been tested in production for a while) and behaviors listed above.
  • Add the ability to delete a single Machine from a MachinePool.
  • Adapt Cluster Autoscaler to support MachinePool by reusing the same code paths that today are being used with MachineDeployments.
  • Allow MachineHealthCheck to be used with MachinePool's Machines.

On the flip side:

  • MachinePool infrastructure controllers must now understand how to delete a single Machine from a group, if they cannot do it, they shouldn't opt in into creating Machines which can be counter intuitive. For this point, we should seek more information on the current state of MachinePool infrastructure implementations out there, and draw some conclusions.

  • When deleting a single machine, we won't rely on a scale down from the scale-set resource from a cloud provider.

    Let's explore this point a bit more:

    • If the replicas count hasn't changed
      • And a Machine is deleted
      • The MachinePool backing implementation (usually cloud provider scale-set) should detect that there are fewer than available replicas, and scale back up
      • If the Machine that has been deleted caused a misbalance, we can probably safely assume that the scale-set cloud resource is going to place a newly added replica in the right place.
    • If the replicas count is being decreased
      • And a Machine is deleted
      • The MachinePool backing implementation (usually cloud provider scale-set) shouldn't proceed with any other operations, keeping the count stable in this case.

Alternatives considered

Some folks pointed out that we could build a controller (which could run in the workload cluster) that watches corev1.Node objects, and move most of the logic above to it.

While this is definitely something to explore, I'm a bit more comfortable to use the existing and proven resources and controllers, instead of building new ones from scratch. A new deployment that needs to run in the workload cluster directly might bring more complications to the overall systems, in terms of version management and lockstep behaviors.

Mostly a core-dump, hopefully this helps 😄

@vincepri
Copy link
Member

/area api
/priority important-longterm
/milestone Next
(hopefully we can address this in v0.4.0, although setting Next for now until we have some consensus)

@k8s-ci-robot k8s-ci-robot added this to the Next milestone Jan 11, 2021
@k8s-ci-robot k8s-ci-robot added area/api Issues or PRs related to the APIs priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jan 11, 2021
@dthorsen
Copy link
Contributor

In our current clusters, we are using cluster-autoscaler with the AWS provider in combination with MachinePools. This greatly simplifies the code in cluster-api and the infra provider since all the instance management logic is effectively delegated to the cloud provider and cluster-autoscaler. When we wish to delete an instance from an autoscaling group, we simply drain the node. cluster-autoscaler detects that the node is empty, and it automatically deletes that instance roughly 10 minutes later. The use case I originally saw for MachinePools, and their distinction from MachineDeployments is that they delegate much of this logic to the cloud provider. If we make the MachinePools logic so similar to MachineDeployments as is proposed above, I wonder why have MachinePools at all, and we could just simply use MachineDeployments.

@CecileRobertMichon
Copy link
Contributor

I think @dthorsen makes a really great point and something we should not forget while discussing these changes. The main goal of MachinePools is to allow delegating functionality to the cloud providers directly (zones, scaling, etc.). If we re-implement scale down/draining/etc. via Machines for MachinePools, we lose this completely. Perhaps a better solution is to have cloud specific autoscaler (and k8s upgrade) implementations for MachinePools since the idea is to stay as close to the infra as possible.

@dthorsen
Copy link
Contributor

@CecileRobertMichon @devigned @h0tbird To this end, we have added an externallyManagedReplicaCount field to the MachinePool spec and we are currently testing this in our development environments. Infrastructure providers could use this field to determine which mode they are running in, CAPI controlled replicas or externally-controlled replicas. We will PR this soon.

@vincepri
Copy link
Member

@dthorsen The logic explained above makes sense, although I'd ask for a small amendment to the MachinePool proposal that describes the field in more details.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2021
@vincepri
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2021
@CecileRobertMichon
Copy link
Contributor

/assign @devigned @dthorsen

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 14, 2021
@CecileRobertMichon
Copy link
Contributor

/lifecycle active

@k8s-ci-robot k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 14, 2021
@devigned
Copy link
Contributor Author

/assign @mboersma

@devigned devigned removed their assignment Sep 14, 2021
@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 6, 2022
@mboersma
Copy link
Contributor

mboersma commented Nov 7, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 7, 2022
@fabriziopandini
Copy link
Member

/lifecycle frozen
/help

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/lifecycle frozen
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Nov 7, 2022
@sbueringer
Copy link
Member

Removing it from the 1.2 milestone as it wasn't implemented in the 1.2 time frame

@sbueringer
Copy link
Member

I think we should consider follow-up tasks now that the first iteration is merged (e.g. MHC support, in-line label propagation, ...)

@sbueringer
Copy link
Member

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jul 11, 2023
@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dtzar
Copy link
Contributor

dtzar commented Jul 17, 2023

@sbueringer - can we close this issue out once #8842 is merged or close it now? And add new issue(s) based on more details of what else you'd like to see enhanced/added? The issue is very old and we have delivered on the core of the ask originally IMO.

@sbueringer
Copy link
Member

Sure, fine for me.

It would probably make sense to create an umbrella issue with an overview over the next iteration

@dtzar
Copy link
Contributor

dtzar commented Jul 17, 2023

Ok, I will look to you for this umbrella issue. Related to #9005

@sbueringer
Copy link
Member

sbueringer commented Jul 18, 2023

It will probably take considerable time until I find the time to create an umbrella issue for MachinePool Machines (especially to do the necessary research to figure out which parts are missing between implementation / the proposal / and to figure out what a MachinePool Machine at the moment doesn't support compared to a "normal" (KCP/MD/MS) Machine. Limited bandwith right now.

But feel free to close this issue, I didn't want to block. Just thought we wanted to do some follow-ups based on the discussions we had on the MachinePool Machine PR.

@dtzar
Copy link
Contributor

dtzar commented Jul 18, 2023

Sounds good, thanks @sbueringer. Jonathan has it set to close automatically when #8842 merges.

@Jont828
Copy link
Contributor

Jont828 commented Jul 31, 2023

@sbueringer I've also opened #9096 just to track the CAPD implementation PR since this issue has gotten really old, and we've separated the task of delivering in CAPI core, clusterctl, and CAPD. IIRC this issue doesn't seem to be about adding support in the test provider, so I think we could move #9096 under an umbrella issue and close this one out. WDYT?

@sbueringer
Copy link
Member

Sounds good to me

@Jont828
Copy link
Contributor

Jont828 commented Aug 9, 2023

/close

@k8s-ci-robot
Copy link
Contributor

@Jont828: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Issues or PRs related to the APIs help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet