Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP unhealthy remediation support #2976

Closed
benmoss opened this issue Apr 28, 2020 · 19 comments · Fixed by #3830
Closed

KCP unhealthy remediation support #2976

benmoss opened this issue Apr 28, 2020 · 19 comments · Fixed by #3830
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@benmoss
Copy link

benmoss commented Apr 28, 2020

User Story

As an operator I would like to have my control plane clusters automatically remediate unhealthy machines.

Detailed Description

  • The cluster should scale up a replacement machine and then scale down the unhealthy machine. Rinse and repeat.
  • The cluster must contain at least one healthy control plane machine for any remediation to happen (clusters where all the nodes are unhealthy are in the best case difficult to remediate, at worst impossible)
  • Clusters must have etcd quorum in order to do automatic remediation

Anything else you would like to add:

TBD: how do we stop remediation from happening in an infinite loop?

We want to move the calls to reconcileHealth from scale up and scale down to make it possible for them to operate on an unhealthy cluster. reconcileHealth should be earlier in the reconciliation loop. Right now it would (probably?) block remediation if we tried to call scaleUp/scaleDown since the cluster would not be healthy.

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 28, 2020
@benmoss
Copy link
Author

benmoss commented Apr 28, 2020

@ncdc @detiber @vincepri for visibility, we spent some time thinking through the infinite remediation problem today and were looking for some help.

@benmoss
Copy link
Author

benmoss commented Apr 29, 2020

Some tricky scenarios to think this over for

2 control plane machines:

  • m1 - healthy
  • m2 - unhealthy

Planned remediation:

  • Scale up / create m3
  • Wait for m3 to join the cluster
  • Scale down / delete m2

Problems:

  • What if m3 never joins the cluster
  • What if m3 joins and then becomes unhealthy
  • What if m2 is no longer unhealthy by the time m3 joins the cluster
  • What if m1 becomes unhealthy

@randomvariable
Copy link
Member

Also would need to ensure we save status appropriately such that the controller can crash mid way through remediation and not go "oh, why have I got this extra instance, let's kill it").

Am wondering if there's any benefit to constructive usage of the raft index points to determine how far a new instance has caught up, or even if an instance is too far behind and needs remediation, at least whilst we don't have learner mode available.

@benmoss
Copy link
Author

benmoss commented Apr 29, 2020

Also would need to ensure we save status appropriately such that the controller can crash mid way through remediation and not go "oh, why have I got this extra instance, let's kill it").

I don't think this is a problem since we will still have access to m2 and therefore can safely infer that we have m3 because it is replacing m2.

Am wondering if there's any benefit to constructive usage of the raft index points to determine how far a new instance has caught up, or even if an instance is too far behind and needs remediation, at least whilst we don't have learner mode available.

This gets into a completely different area of messiness, which is "what kind of unhealthy do we have here."

  • MHC can mark a node as unhealthy for something as simple as a condition on a Node, which means a user could setup a MHC that triggers on something as simple as that a node has high memory consumption or is running low on disk space. Potential footguns here.
  • Some static pods might be broken, but etcd might be fine
  • etcd might be completely gone
  • We might be experiencing a network partition
  • The host might have disappeared into a black hole

@randomvariable
Copy link
Member

Might be worth some state machine or other form of modelling

@vincepri
Copy link
Member

/milestone v0.3.x
/priority important-soon
/assign @benmoss

@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone Apr 29, 2020
@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 29, 2020
@benmoss
Copy link
Author

benmoss commented Apr 29, 2020

We discussed this again and came to the idea that we need to start by defining narrow cases that we know we can handle and deferring the general solution until we can nail more of these down.

We came up with three cases that seem tractable to start with:

  • The machine's backing VM is gone
  • The machine's backing infrastructure machine is gone
  • The machine has never successfully booted

We will exclude clusters with fewer than 3 nodes.

I'd propose we start with these and then keep sketching out additional remediation scenarios as we move forward. These are not MHC remediations, since MHC is designed for a generic case of "machine is unhealthy" and we are trying to start as narrowly as possible.

Something @detiber pointed out that's relevant to this is that with smaller clusters, scaling up etcd can potentially mean that we change the number required for quorum in such a way that we could render the cluster inoperable. This table is a useful guide to thinking about this. If we have a 3 node cluster with 1 unhealthy and we scale up, our quorum size becomes 4 and only have 3 healthy so we would go into read-only mode. If the new member never joins, we get into a situation where we need to recreate the cluster since you need majority in order to remove a member.

etcd's docs recommend that unhealthy members are removed before new members are added.

@vincepri
Copy link
Member

vincepri commented May 5, 2020

/milestone v0.3.6

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.x, v0.3.6 May 5, 2020
@vincepri
Copy link
Member

vincepri commented May 5, 2020

/cc @detiber @ncdc

@benmoss
Copy link
Author

benmoss commented May 5, 2020

We discussed this again today and we think that with a strategy of scaling down first, we can handle the generic case well enough, assuming we restrict ourselves to clusters that have at least 3 nodes.

Goals

  • Remediate HA (3+ replica) control planes

Requirements

  • We only remediate one machine at a time
  • We only remediate machines that have the unhealthy annotation

Cleanup

  • Move reconcileHealth to above reconciliation and out of scaleUp / scaleDown, it will need to change as well to not block remediation.

pseudocode:

if anyUnhealthy && numMachines == desiredReplicas {
  scaledown()
}
if numMachines < desiredReplicas {
  scaleUp()
}

@vincepri
Copy link
Member

/milestone v0.3.x

@fabriziopandini
Copy link
Member

@benmoss wrote up a doc explaining the design goals of this feature here: https://docs.google.com/document/d/1hJza3X-XbVV_yczB5N6vXbl_97D0bOVQ0OwGovcnth0/edit?usp=sharing

@vincepri
Copy link
Member

vincepri commented Aug 3, 2020

/milestone v0.3.9

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.x, v0.3.9 Aug 3, 2020
@vincepri
Copy link
Member

vincepri commented Aug 3, 2020

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.9, v0.4.0 Aug 3, 2020
@vincepri
Copy link
Member

vincepri commented Aug 3, 2020

We've discussed during a grooming session today that this feature might be impacting existing deployments and we'd like to have more time to think through all the bits. For that reason, we'll try to tackle it during v0.4.x timeframe.

@sadysnaat
Copy link
Contributor

We have a use case where we are creating a cluster and the kubeadm init fails for the first controlplane machine and cluster creation is stopped. Should we have some option to enable remediation for first controlplane node?

@fabriziopandini
Copy link
Member

@sadysnaat
This could be an interesting use case for a follow-up iteration on KCP remediation, but before doing that, IMO we should make some clarity on:

@vincepri
Copy link
Member

Closing as the related PRs have been merged

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

Closing as the related PRs have been merged

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
6 participants