-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KCP unhealthy remediation support #2976
Comments
Some tricky scenarios to think this over for 2 control plane machines:
Planned remediation:
Problems:
|
Also would need to ensure we save status appropriately such that the controller can crash mid way through remediation and not go "oh, why have I got this extra instance, let's kill it"). Am wondering if there's any benefit to constructive usage of the raft index points to determine how far a new instance has caught up, or even if an instance is too far behind and needs remediation, at least whilst we don't have learner mode available. |
I don't think this is a problem since we will still have access to m2 and therefore can safely infer that we have m3 because it is replacing m2.
This gets into a completely different area of messiness, which is "what kind of unhealthy do we have here."
|
Might be worth some state machine or other form of modelling |
/milestone v0.3.x |
We discussed this again and came to the idea that we need to start by defining narrow cases that we know we can handle and deferring the general solution until we can nail more of these down. We came up with three cases that seem tractable to start with:
We will exclude clusters with fewer than 3 nodes. I'd propose we start with these and then keep sketching out additional remediation scenarios as we move forward. These are not MHC remediations, since MHC is designed for a generic case of "machine is unhealthy" and we are trying to start as narrowly as possible. Something @detiber pointed out that's relevant to this is that with smaller clusters, scaling up etcd can potentially mean that we change the number required for quorum in such a way that we could render the cluster inoperable. This table is a useful guide to thinking about this. If we have a 3 node cluster with 1 unhealthy and we scale up, our quorum size becomes 4 and only have 3 healthy so we would go into read-only mode. If the new member never joins, we get into a situation where we need to recreate the cluster since you need majority in order to remove a member. etcd's docs recommend that unhealthy members are removed before new members are added. |
/milestone v0.3.6 |
We discussed this again today and we think that with a strategy of scaling down first, we can handle the generic case well enough, assuming we restrict ourselves to clusters that have at least 3 nodes. Goals
Requirements
Cleanup
pseudocode:
|
/milestone v0.3.x |
@benmoss wrote up a doc explaining the design goals of this feature here: https://docs.google.com/document/d/1hJza3X-XbVV_yczB5N6vXbl_97D0bOVQ0OwGovcnth0/edit?usp=sharing |
/milestone v0.3.9 |
/milestone v0.4.0 |
We've discussed during a grooming session today that this feature might be impacting existing deployments and we'd like to have more time to think through all the bits. For that reason, we'll try to tackle it during v0.4.x timeframe. |
We have a use case where we are creating a cluster and the kubeadm init fails for the first controlplane machine and cluster creation is stopped. Should we have some option to enable remediation for first controlplane node? |
@sadysnaat
|
Closing as the related PRs have been merged /close |
@vincepri: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
User Story
As an operator I would like to have my control plane clusters automatically remediate unhealthy machines.
Detailed Description
Anything else you would like to add:
TBD: how do we stop remediation from happening in an infinite loop?
We want to move the calls to
reconcileHealth
from scale up and scale down to make it possible for them to operate on an unhealthy cluster.reconcileHealth
should be earlier in the reconciliation loop. Right now it would (probably?) block remediation if we tried to call scaleUp/scaleDown since the cluster would not be healthy./kind feature
The text was updated successfully, but these errors were encountered: