KCP unhealthy remediation support #2976

benmoss · 2020-04-28T22:47:22Z

User Story

As an operator I would like to have my control plane clusters automatically remediate unhealthy machines.

Detailed Description

The cluster should scale up a replacement machine and then scale down the unhealthy machine. Rinse and repeat.
The cluster must contain at least one healthy control plane machine for any remediation to happen (clusters where all the nodes are unhealthy are in the best case difficult to remediate, at worst impossible)
Clusters must have etcd quorum in order to do automatic remediation

Anything else you would like to add:

TBD: how do we stop remediation from happening in an infinite loop?

We want to move the calls to reconcileHealth from scale up and scale down to make it possible for them to operate on an unhealthy cluster. reconcileHealth should be earlier in the reconciliation loop. Right now it would (probably?) block remediation if we tried to call scaleUp/scaleDown since the cluster would not be healthy.

/kind feature

The text was updated successfully, but these errors were encountered:

benmoss · 2020-04-28T22:49:09Z

@ncdc @detiber @vincepri for visibility, we spent some time thinking through the infinite remediation problem today and were looking for some help.

benmoss · 2020-04-29T13:30:48Z

Some tricky scenarios to think this over for

2 control plane machines:

m1 - healthy
m2 - unhealthy

Planned remediation:

Scale up / create m3
Wait for m3 to join the cluster
Scale down / delete m2

Problems:

What if m3 never joins the cluster
What if m3 joins and then becomes unhealthy
What if m2 is no longer unhealthy by the time m3 joins the cluster
What if m1 becomes unhealthy

randomvariable · 2020-04-29T13:39:57Z

Also would need to ensure we save status appropriately such that the controller can crash mid way through remediation and not go "oh, why have I got this extra instance, let's kill it").

Am wondering if there's any benefit to constructive usage of the raft index points to determine how far a new instance has caught up, or even if an instance is too far behind and needs remediation, at least whilst we don't have learner mode available.

benmoss · 2020-04-29T14:05:17Z

Also would need to ensure we save status appropriately such that the controller can crash mid way through remediation and not go "oh, why have I got this extra instance, let's kill it").

I don't think this is a problem since we will still have access to m2 and therefore can safely infer that we have m3 because it is replacing m2.

Am wondering if there's any benefit to constructive usage of the raft index points to determine how far a new instance has caught up, or even if an instance is too far behind and needs remediation, at least whilst we don't have learner mode available.

This gets into a completely different area of messiness, which is "what kind of unhealthy do we have here."

MHC can mark a node as unhealthy for something as simple as a condition on a Node, which means a user could setup a MHC that triggers on something as simple as that a node has high memory consumption or is running low on disk space. Potential footguns here.
Some static pods might be broken, but etcd might be fine
etcd might be completely gone
We might be experiencing a network partition
The host might have disappeared into a black hole

randomvariable · 2020-04-29T14:52:44Z

Might be worth some state machine or other form of modelling

vincepri · 2020-04-29T18:04:14Z

/milestone v0.3.x
/priority important-soon
/assign @benmoss

benmoss · 2020-04-29T21:00:16Z

We discussed this again and came to the idea that we need to start by defining narrow cases that we know we can handle and deferring the general solution until we can nail more of these down.

We came up with three cases that seem tractable to start with:

The machine's backing VM is gone
The machine's backing infrastructure machine is gone
The machine has never successfully booted

We will exclude clusters with fewer than 3 nodes.

I'd propose we start with these and then keep sketching out additional remediation scenarios as we move forward. These are not MHC remediations, since MHC is designed for a generic case of "machine is unhealthy" and we are trying to start as narrowly as possible.

Something @detiber pointed out that's relevant to this is that with smaller clusters, scaling up etcd can potentially mean that we change the number required for quorum in such a way that we could render the cluster inoperable. This table is a useful guide to thinking about this. If we have a 3 node cluster with 1 unhealthy and we scale up, our quorum size becomes 4 and only have 3 healthy so we would go into read-only mode. If the new member never joins, we get into a situation where we need to recreate the cluster since you need majority in order to remove a member.

etcd's docs recommend that unhealthy members are removed before new members are added.

vincepri · 2020-05-05T17:57:33Z

/milestone v0.3.6

vincepri · 2020-05-05T17:57:43Z

/cc @detiber @ncdc

benmoss · 2020-05-05T19:40:02Z

We discussed this again today and we think that with a strategy of scaling down first, we can handle the generic case well enough, assuming we restrict ourselves to clusters that have at least 3 nodes.

Goals

Remediate HA (3+ replica) control planes

Requirements

We only remediate one machine at a time
We only remediate machines that have the unhealthy annotation

Cleanup

Move reconcileHealth to above reconciliation and out of scaleUp / scaleDown, it will need to change as well to not block remediation.

pseudocode:

if anyUnhealthy && numMachines == desiredReplicas {
  scaledown()
}
if numMachines < desiredReplicas {
  scaleUp()
}

vincepri · 2020-05-15T14:36:41Z

/milestone v0.3.x

fabriziopandini · 2020-08-03T17:56:47Z

@benmoss wrote up a doc explaining the design goals of this feature here: https://docs.google.com/document/d/1hJza3X-XbVV_yczB5N6vXbl_97D0bOVQ0OwGovcnth0/edit?usp=sharing

vincepri · 2020-08-03T17:57:03Z

/milestone v0.3.9

vincepri · 2020-08-03T18:15:49Z

/milestone v0.4.0

vincepri · 2020-08-03T18:16:51Z

We've discussed during a grooming session today that this feature might be impacting existing deployments and we'd like to have more time to think through all the bits. For that reason, we'll try to tackle it during v0.4.x timeframe.

sadysnaat · 2020-10-28T14:32:33Z

We have a use case where we are creating a cluster and the kubeadm init fails for the first controlplane machine and cluster creation is stopped. Should we have some option to enable remediation for first controlplane node?

fabriziopandini · 2020-10-29T09:59:30Z

@sadysnaat
This could be an interesting use case for a follow-up iteration on KCP remediation, but before doing that, IMO we should make some clarity on:

describing how to detect with certainty a bootstrap failure (see eg. v1alpha4: write sentinel file as part of bootstrap #3716).
defining the component responsible for handling bootstrap failures (the machine controller or MHC + the owner controller).

vincepri · 2020-12-15T20:10:55Z

Closing as the related PRs have been merged

/close

k8s-ci-robot · 2020-12-15T20:11:04Z

@vincepri: Closing this issue.

In response to this:

Closing as the related PRs have been merged

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 28, 2020

k8s-ci-robot assigned benmoss Apr 29, 2020

k8s-ci-robot added this to the v0.3.x milestone Apr 29, 2020

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 29, 2020

randomvariable mentioned this issue Apr 29, 2020

Failed first control plane node creates unrecoverable failure #2960

Closed

k8s-ci-robot modified the milestones: v0.3.x, v0.3.6 May 5, 2020

benmoss mentioned this issue May 6, 2020

[WIP] ✨ KCP remediation #3022

Closed

k8s-ci-robot modified the milestones: v0.3.6, v0.3.x May 15, 2020

benmoss mentioned this issue May 28, 2020

🌱 MHC marks machines for remediation with conditions #3108

Merged

benmoss mentioned this issue Jun 11, 2020

✨ Add MHC remediation to KCP #3185

Closed

k8s-ci-robot modified the milestones: v0.3.x, v0.3.9 Aug 3, 2020

k8s-ci-robot modified the milestones: v0.3.9, v0.4.0 Aug 3, 2020

fabriziopandini mentioned this issue Sep 22, 2020

📖 KCP remediation proposal #3676

Merged

fabriziopandini mentioned this issue Oct 20, 2020

✨ KCP remediates unhealthy machines #3830

Merged

fabriziopandini modified the milestones: v0.4.0, v0.3.11 Oct 27, 2020

fabriziopandini linked a pull request Oct 27, 2020 that will close this issue

✨ KCP remediates unhealthy machines #3830

Merged

k8s-ci-robot closed this as completed Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KCP unhealthy remediation support #2976

KCP unhealthy remediation support #2976

benmoss commented Apr 28, 2020 •

edited

Loading

benmoss commented Apr 28, 2020

benmoss commented Apr 29, 2020

randomvariable commented Apr 29, 2020

benmoss commented Apr 29, 2020 •

edited

Loading

randomvariable commented Apr 29, 2020

vincepri commented Apr 29, 2020

benmoss commented Apr 29, 2020 •

edited

Loading

vincepri commented May 5, 2020

vincepri commented May 5, 2020

benmoss commented May 5, 2020 •

edited

Loading

vincepri commented May 15, 2020

fabriziopandini commented Aug 3, 2020

vincepri commented Aug 3, 2020

vincepri commented Aug 3, 2020

vincepri commented Aug 3, 2020

sadysnaat commented Oct 28, 2020

fabriziopandini commented Oct 29, 2020

vincepri commented Dec 15, 2020

k8s-ci-robot commented Dec 15, 2020

KCP unhealthy remediation support #2976

KCP unhealthy remediation support #2976

Comments

benmoss commented Apr 28, 2020 • edited Loading

benmoss commented Apr 28, 2020

benmoss commented Apr 29, 2020

randomvariable commented Apr 29, 2020

benmoss commented Apr 29, 2020 • edited Loading

randomvariable commented Apr 29, 2020

vincepri commented Apr 29, 2020

benmoss commented Apr 29, 2020 • edited Loading

vincepri commented May 5, 2020

vincepri commented May 5, 2020

benmoss commented May 5, 2020 • edited Loading

vincepri commented May 15, 2020

fabriziopandini commented Aug 3, 2020

vincepri commented Aug 3, 2020

vincepri commented Aug 3, 2020

vincepri commented Aug 3, 2020

sadysnaat commented Oct 28, 2020

fabriziopandini commented Oct 29, 2020

vincepri commented Dec 15, 2020

k8s-ci-robot commented Dec 15, 2020

benmoss commented Apr 28, 2020 •

edited

Loading

benmoss commented Apr 29, 2020 •

edited

Loading

benmoss commented Apr 29, 2020 •

edited

Loading

benmoss commented May 5, 2020 •

edited

Loading