Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual remediation of Machines #10197

Closed
fabriziopandini opened this issue Feb 26, 2024 · 3 comments · Fixed by #10202
Closed

Manual remediation of Machines #10197

fabriziopandini opened this issue Feb 26, 2024 · 3 comments · Fixed by #10202
Assignees
Labels
area/machinehealthcheck Issues or PRs related to machinehealthchecks kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@fabriziopandini
Copy link
Member

What would you like to be added (User Story)?

As an operator, I would like to be able to recreate single machines in a controlled way

Detailed Description

The traditional way to re-create a single Machine is to delete it manually.

However, directly deleting a machine has some well-known downsides:

  • Once the operation is done, it cannot be reverted
  • The operation is "low level", it skips several system safeguards and this can be risky especially when you delete control plane nodes (see discussion in Restrict KCP Machine deletion #9919). More speifically
    • With manual deletion, all the checks about max unhealthy, are implicitly skipped
    • With manual deletion, all the checks about the fact the machine can be safely remediated, are implicitly skipped

This proposal is about introducing a new annotation (name TBD) to be applied to a machine that the operator wants to safely delete (and thus to leave "low level" deletion only as the ultimate escape patch).

This annotation will be processed by the MHC controller in healthCheckTargets, and this the machine will be included in the pool of machines to be considered for remediation.

MHC, will then determine if there are the conditions to remediate such a machine, and the remediation with then be taken charge of by the owning controller or by the external remediation tool.

The remediation owner will then perform another round of checks, and finally transform the manual remediation in an actual machine deletion.

Anything else you would like to add?

Once this is in place, we can eventually think of strategies to give higher priority for deletion to machines being remediated/manually remediated, in case more machines are being deleted at the same time.
However this is less important than providing a safe path to deletion (and thus IMO not blocking for a first iteration)

Label(s) to be applied

/kind feature
/area machinehealthcheck
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/machinehealthcheck Issues or PRs related to machinehealthchecks triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 26, 2024
@chrischdi
Copy link
Member

/assign

@Levi080513
Copy link
Contributor

Nice idea!!!
It would be better if we could support this feature in clusterctl, like kubectl drain.

@sbueringer
Copy link
Member

I think we could consider additionally implementing a clusterctl command. But in any case this should be implemented in our controllers so other clients which are not clusterctl can also use the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machinehealthcheck Issues or PRs related to machinehealthchecks kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants