Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 Amend KCP proposal with remediation while provisioning the CP #7855

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions docs/proposals/20191017-kubeadm-based-control-plane.md
Original file line number Diff line number Diff line change
Expand Up @@ -472,12 +472,20 @@ When `MaxSurge` is set to 0 the rollout algorithm is as follows:
for additional details. When there are multiple machines that are marked for remediation, the oldest one will be remediated first.

- Following rules should be satisfied in order to start remediation
- The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated.
- The number of replicas MUST be equal to or greater than the desired replicas. This rule ensures that when the cluster
is missing replicas, we skip remediation and instead perform regular scale up/rollout operations first.
- One of the following apply:
- The cluster MUST not be initialized yet (the failure happens before KCP reaches the initialized state)
- The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated.
- Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP to remediate more machines while the
replacement for the previous machine is not yet created.
- The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state.
- Remediation MUST preserve etcd quorum. This rule ensures that we will not remove a member that would result in etcd
losing a majority of members and thus become unable to field new requests.
losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP with at least replicas)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP with at least replicas)
losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP with at least 3 replicas)

?

Copy link
Member Author

@fabriziopandini fabriziopandini Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up, opened #8018


- Additionally following opt-in safeguards will be put in place:
- If we are remediating the same machine (delete, re-create, replacement machine gets unhealthy), it will be possible
to define a maximum number of retries, thus preventing unnecessary load on infrastructure provider e.g. in case of quota problems.
- If we are remediating the same machine (delete, re-create, replacement machine gets unhealthy), it will be possible
to define a delay between each retry, thus allowing the infrastructure provider to stabilize in case of temporary problems.

- When all the conditions for starting remediation are satisfied, KCP temporarily suspend any operation in progress
in order to perform remediation.
Expand Down Expand Up @@ -634,4 +642,5 @@ For the purposes of designing upgrades, two existing lifecycle managers were exa
- [x] 12/04/2019: Initial stubbed KubeadmControlPlane controller added [#1826](https://github.com/kubernetes-sigs/cluster-api/pull/1826)
- [x] 07/09/2020: Document updated to reflect changes up to v0.3.9 release
- [x] 22/09/2020: KCP remediation added
- [x] XX/XX/2020: KCP rollout strategies added
- [x] 10/05/2021: Support for remediation of failures while upgrading 1 node CP
- [x] 05/01/2022: Support for remediation while provisioning the CP (both first CP and CP machines while current replica < desired replica); Allow control of remediation retry behavior.