Manual remediation of Machines #10197

fabriziopandini · 2024-02-26T11:24:39Z

What would you like to be added (User Story)?

As an operator, I would like to be able to recreate single machines in a controlled way

Detailed Description

The traditional way to re-create a single Machine is to delete it manually.

However, directly deleting a machine has some well-known downsides:

Once the operation is done, it cannot be reverted
The operation is "low level", it skips several system safeguards and this can be risky especially when you delete control plane nodes (see discussion in Restrict KCP Machine deletion #9919). More speifically
- With manual deletion, all the checks about max unhealthy, are implicitly skipped
- With manual deletion, all the checks about the fact the machine can be safely remediated, are implicitly skipped

This proposal is about introducing a new annotation (name TBD) to be applied to a machine that the operator wants to safely delete (and thus to leave "low level" deletion only as the ultimate escape patch).

This annotation will be processed by the MHC controller in healthCheckTargets, and this the machine will be included in the pool of machines to be considered for remediation.

MHC, will then determine if there are the conditions to remediate such a machine, and the remediation with then be taken charge of by the owning controller or by the external remediation tool.

The remediation owner will then perform another round of checks, and finally transform the manual remediation in an actual machine deletion.

Anything else you would like to add?

Once this is in place, we can eventually think of strategies to give higher priority for deletion to machines being remediated/manually remediated, in case more machines are being deleted at the same time.
However this is less important than providing a safe path to deletion (and thus IMO not blocking for a first iteration)

Label(s) to be applied

/kind feature
/area machinehealthcheck
/triage accepted

chrischdi · 2024-02-27T15:54:06Z

/assign

Levi080513 · 2024-03-01T06:24:35Z

Nice idea!!!
It would be better if we could support this feature in clusterctl, like kubectl drain.

sbueringer · 2024-03-13T12:22:37Z

I think we could consider additionally implementing a clusterctl command. But in any case this should be implemented in our controllers so other clients which are not clusterctl can also use the feature.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/machinehealthcheck Issues or PRs related to machinehealthchecks triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 26, 2024

k8s-ci-robot assigned chrischdi Feb 27, 2024

chrischdi mentioned this issue Feb 27, 2024

✨ MHC: implement annotation to manually mark machines for remediation #10202

Merged

k8s-ci-robot closed this as completed in #10202 Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual remediation of Machines #10197

Manual remediation of Machines #10197

fabriziopandini commented Feb 26, 2024

chrischdi commented Feb 27, 2024

Levi080513 commented Mar 1, 2024

sbueringer commented Mar 13, 2024

Manual remediation of Machines #10197

Manual remediation of Machines #10197

Comments

fabriziopandini commented Feb 26, 2024

What would you like to be added (User Story)?

Detailed Description

Anything else you would like to add?

Label(s) to be applied

chrischdi commented Feb 27, 2024

Levi080513 commented Mar 1, 2024

sbueringer commented Mar 13, 2024