commands: add command to wiping cluster #173

Javlopez · 2023-09-12T15:21:09Z

Resolves: #131

This Pr adds a new command to destroy rook cluster removing the CRDs and the toolbox deployment

pkg/crds/remove.go

cmd/commands/destroy_cluster.go

BlaineEXE

This is a lot of duplication to remove resources, and Rook adds/modifies resources occasionally. This code will likely be a lot of work to update over time, and my experience has shown that this tool will eventually stop being relevant due to skew from Rook over time. I would encourage you/us to consider how to re-implement this more generically so that we aren't (1) duplicating code and (2) giving ourselves a pretty large amount of ongoing maintenance burden for the forseeable future.

[Updates]

The k8s "dynamic" API allows working with resources without having to strongly type them:
https://itnext.io/generically-working-with-kubernetes-resources-in-go-53bce678f887

I'd suggest querying kubernetes for all CRD kinds, and then finding the CRDs that are ceph.rook.io, then deleting all of each Kind using the dynamic API. It can also look for OBCs as a special case, but it should do filtering to make sure it isn't deleting non-rook OBCs. For an initial version of this PR, it'll be cleaner to leave OBCs out and put them in later.

Likely, the only exception to the dynamic approach is CephCluster, which needs special handling to allow setting the cleanup policy.

travisn · 2023-09-20T19:09:53Z

Agreed, will help the code maintenance if we can avoid the strongly typed API. If we can use the dynamic API as suggested, I'm thinking we could just iterate of all resources in the rook namespace, delete them, and remove their finalizers. As suggested here let's not worry about the special case yet for the cluster cleanup policy.

When I say we should iterate over all resources, we would only iterate over the specific types we expect to delete such as all the ceph.rook.io CRs, the OBC CRs, specific secrets, and specific configmaps.

this tool will eventually stop being relevant due to skew from Rook over time

@BlaineEXE Are you referring to the strongly typed code being too much to maintain if we don't change it, or what will stop being relevant?

Javlopez · 2023-09-20T19:13:06Z

thanks for the feedback @BlaineEXE totally agree, this PR is actually on draft for that reason I mean I was not sure that this way was the right one.

I gonna rework this PR based on your suggestion

BlaineEXE · 2023-09-20T19:54:22Z

@BlaineEXE Are you referring to the strongly typed code being too much to maintain if we don't change it, or what will stop being relevant?

I think you and I are on the same page. I'll try to clarify what I mean here by "relevant," though in short, I mean "in/out of skew"

Unless the krew plugin version and Rook versions are either strongly coupled or strongly decoupled (what krew plugin assumes today), the implementation of this feature will likely skew slowly over time compared to ongoing changes in Rook. Eventually, the krew implementation will become sufficiently out of date compared that it will become frustrating for users.

This risk exists even with the "dynamic" API, but there are at least fewer interaction points that are likely to break or become stale.

When I say we should iterate over all resources, we would only iterate over the specific types we expect to delete such as all the ceph.rook.io CRs, the OBC CRs, specific secrets, and specific configmaps.

IMO, this is also still a little to hardcoded to be resilient to skew. I think the word "specific" is a good indicator.

I think the most resilient design would be to effectively try to generically delete anything and everything in a given namespace. The specific-to-rook part would be to delete all CRs that exist for .rook.ceph.io CRDs. If we know there are problems with configmaps or secrets not getting deleted, then we should also delete those generically.

I would also suggest leaving OBCs out of the implementation at least initially, if not fully. OBCs can exist for other providers like Noobaa, and the implementation should attempt to not delete other providers' OBCs. Consider that this tool likely isn't going to remove user-created applications that are consuming rbd/cephfs PVCs. And likely, this tool won't implement deletion behavior for COSI resources in the future. I think it's easiest for developers to just stop deletions with this tool outside of the Rook API boundary.

I'd suggest that this be implemented in two separate PRs... First to clean the cluster without the cleanupPolicy options, then later to also implement the cleanupPolicy.

Having written, re-written, and then re-re-written a personal bash script for uninstalling Rook, I personally think that including cleanupPolicy in the initial design would actually be best overall. It's not much more code or difficulty, and the methodology for watching for Rook resources stuck in deletion will change more than you might expect between cleaupPolicy "on" versus "off" cases.

I also base that on an assumption that this tool execution does not allow users to choose whether to clean up or not. I assume it to be for development and proof-of-concept clusters only and not something to use on production clusters. IMO, this should only be for dev/poc clusters. I would really avoid doing this for production clusters because it would make it wildly easy for someone to destroy their storage

cmd/commands/destroy_cluster.go

travisn · 2023-09-20T20:20:49Z

@BlaineEXE Are you referring to the strongly typed code being too much to maintain if we don't change it, or what will stop being relevant?

I think you and I are on the same page. I'll try to clarify what I mean here by "relevant," though in short, I mean "in/out of skew"

Unless the krew plugin version and Rook versions are either strongly coupled or strongly decoupled (what krew plugin assumes today), the implementation of this feature will likely skew slowly over time compared to ongoing changes in Rook. Eventually, the krew implementation will become sufficiently out of date compared that it will become frustrating for users.

This risk exists even with the "dynamic" API, but there are at least fewer interaction points that are likely to break or become stale.

The goal of this command as discussed in the original issue #131 is really a "best effort" tool to help the admin clean up a cluster where they are absolutely done with Rook. We know it's quite frustrating today to cleanup up a cluster, thus the desire to improve the admin's hope of cleaning up a cluster. Whether it's a test cluster or a production cluster, the admin could choose to use this destructive command if they choose, but that's why we are proposing the flag to ask if they really really really mean it.

Over time as we add more CRDs or other resources, the tool should also be updated to clean up those resources. This seems to me like normal software maintenance, I don't see the extra concern for skew.

When I say we should iterate over all resources, we would only iterate over the specific types we expect to delete such as all the ceph.rook.io CRs, the OBC CRs, specific secrets, and specific configmaps.

IMO, this is also still a little to hardcoded to be resilient to skew. I think the word "specific" is a good indicator.

I think the most resilient design would be to effectively try to generically delete anything and everything in a given namespace. The specific-to-rook part would be to delete all CRs that exist for .rook.ceph.io CRDs. If we know there are problems with configmaps or secrets not getting deleted, then we should also delete those generically.

As long as we only delete resources we know are related to Rook, that is the goal IMO. My point was more that non-rook types such as secrets and configmaps need to be carefully deleted since we only want to delete our own. If the admin is using the namespace for anything else, we don't want to delete those.

I would also suggest leaving OBCs out of the implementation at least initially, if not fully. OBCs can exist for other providers like Noobaa, and the implementation should attempt to not delete other providers' OBCs. Consider that this tool likely isn't going to remove user-created applications that are consuming rbd/cephfs PVCs. And likely, this tool won't implement deletion behavior for COSI resources in the future. I think it's easiest for developers to just stop deletions with this tool outside of the Rook API boundary.

Agreed on the OBCs. To make it more clear, I would expect the tool only to delete Rook resources, and not any application resources. OBCs, PVCs, etc all belong to the application. But the tool would be deleting the object store and OBs in the namespace, so the OBCs would be effectively broken just like any rbd or cephfs PVCs.

I'd suggest that this be implemented in two separate PRs... First to clean the cluster without the cleanupPolicy options, then later to also implement the cleanupPolicy.

Having written, re-written, and then re-re-written a personal bash script for uninstalling Rook, I personally think that including cleanupPolicy in the initial design would actually be best overall. It's not much more code or difficulty, and the methodology for watching for Rook resources stuck in deletion will change more than you might expect between cleaupPolicy "on" versus "off" cases.

My assumption is that implementing the cleanupPolicy will complicate the first pass of the implementation. But if this is anyway the main use case (see next paragraph), then we could certainly do it with this PR.

I also base that on an assumption that this tool execution does not allow users to choose whether to clean up or not. I assume it to be for development and proof-of-concept clusters only and not something to use on production clusters. IMO, this should only be for dev/poc clusters. I would really avoid doing this for production clusters because it would make it wildly easy for someone to destroy their storage

While it's certainly not expected to use this on production clusters where data is still there, the admin could decide they are done with a production cluster and want to destroy all the data. While it really is recommended they go through the uninstall guide and remove applications and cleanup rook resources in the correct order, I don't think we should prevent the tool from being used in that scenario either.

If this command is most useful for development scenarios so we can avoid scripts like you've rewritten several times, that will be the best scenario for maintaining this tool anyway. It's the perfect scenario to maintain because if the developers see that it doesn't clean up properly, we can fix it.

BlaineEXE · 2023-09-20T21:36:59Z

Over time as we add more CRDs or other resources, the tool should also be updated to clean up those resources. This seems to me like normal software maintenance, I don't see the extra concern for skew.

It is true that skew is a risk and software must be maintained. We can't escape from that. But I want to offer a recommendation that we consider the maintenance burden of the implementation and to offer my experience that the more we code for "specific" resources or resource types, the more maintenance it will take to keep this script working over time.

As long as we only delete resources we know are related to Rook, that is the goal IMO. My point was more that non-rook types such as secrets and configmaps need to be carefully deleted since we only want to delete our own. If the admin is using the namespace for anything else, we don't want to delete those.

This is a good point. Still, I think we could look for configmaps/secrets that reference the CephCluster or some other CephXYZ kind as a resource owner rather than hardcoding resource names.

If this command is most useful for development scenarios so we can avoid scripts like you've rewritten several times, that will be the best scenario for maintaining this tool anyway. It's the perfect scenario to maintain because if the developers see that it doesn't clean up properly, we can fix it.

You are right that using this for dev scenarios is important because developers will regularly see if the script is not working. I would also like to offer that this is a double-edged sword. If the script is not well implemented, it will be a burden on developers who are regularly faced with a broken tool and who get exhausted trying to maintain it as it breaks and needs fixed. I have seen development tools be abandoned over time because they just end up being too much of a constant burden compared to their benefit, and if that happens with this tool, users will lose out as well.

I'm not at all saying "don't do this." What I am trying to say is that, based on my experience, planning the design and implementation of this from onset will save quite a lot developer headache in the long run. In my experience, the most robust design here will minimize reliance on Rook-specific APIs as much as possible. I would also encourage considering also consider resource and label names to be part of Rook's implied API. A robust design will center around Rook's most core and unchanging API points. That will be the .ceph.rook.io Resource/Kind naming, which is I would guess the least likely thing to change about Rook's implementation in the next 10+ years.

subhamkrai

@Javlopez please add CI tests for this. Thanks

pkg/crds/crds.go

subhamkrai

@Javlopez Thanks for the PR!

cmd/commands/destroy_cluster.go

cmd/commands/root.go

pkg/crds/crds.go

pkg/crds/crds_test.go

pkg/k8sutil/interface.go

cmd/commands/destroy_cluster.go

pkg/crds/crds_test.go

cmd/commands/destroy_cluster.go

docs/destroy_cluster.md

cmd/commands/destroy_cluster.go

travisn

This works great now, thanks! Just a couple final suggestions on the messages.

cmd/commands/destroy_cluster.go

docs/destroy-cluster.md

pkg/crds/crds.go

travisn

Looks good! just a minor suggestion in the logging

pkg/crds/crds.go

Added command destroy-cluster for remove all crds Added documentation for destroy-cluster Added unittest and integration tests Added delete toolbox deployment Signed-off-by: Javier <sjavierlopez@gmail.com>

Javlopez self-assigned this Sep 12, 2023

travisn reviewed Sep 12, 2023

View reviewed changes

pkg/crds/remove.go Outdated Show resolved Hide resolved

subhamkrai reviewed Sep 14, 2023

View reviewed changes

cmd/commands/destroy_cluster.go Show resolved Hide resolved

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from f4c0309 to 9a1368b Compare September 20, 2023 15:11

subhamkrai reviewed Sep 20, 2023

View reviewed changes

cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved

BlaineEXE requested changes Sep 20, 2023

View reviewed changes

BlaineEXE reviewed Sep 20, 2023

View reviewed changes

cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved

subhamkrai requested changes Sep 26, 2023

View reviewed changes

travisn requested changes Sep 26, 2023

View reviewed changes

pkg/crds/crds.go Outdated Show resolved Hide resolved

pkg/crds/crds.go Outdated Show resolved Hide resolved

pkg/crds/crds.go Outdated Show resolved Hide resolved

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 3 times, most recently from 5ffce6a to e976eca Compare September 27, 2023 03:11

subhamkrai requested changes Sep 27, 2023

View reviewed changes

travisn requested changes Sep 27, 2023

View reviewed changes

cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved

cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved

pkg/crds/crds_test.go Show resolved Hide resolved

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from 3796dd2 to cd7ebc7 Compare October 1, 2023 00:33

Javlopez marked this pull request as ready for review October 1, 2023 00:34

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from cd7ebc7 to abb3f42 Compare October 1, 2023 00:38

Javlopez requested review from BlaineEXE, subhamkrai and travisn October 1, 2023 00:45

travisn requested changes Oct 2, 2023

View reviewed changes

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 4 times, most recently from d76512c to 3a10c07 Compare October 3, 2023 02:29

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from f5cddd8 to 07e2903 Compare November 16, 2023 15:42

travisn requested changes Nov 16, 2023

View reviewed changes

cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved

docs/destroy-cluster.md Show resolved Hide resolved

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 2 times, most recently from 87163d2 to 8691f05 Compare November 29, 2023 20:52

Javlopez requested a review from travisn November 29, 2023 21:14

travisn requested changes Nov 30, 2023

View reviewed changes

pkg/crds/crds.go Show resolved Hide resolved

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 5 times, most recently from ccc7962 to 3542613 Compare December 12, 2023 02:13

travisn reviewed Dec 12, 2023

View reviewed changes

pkg/crds/crds.go Outdated Show resolved Hide resolved

travisn mentioned this pull request Dec 12, 2023

Setup tmate debugging during the CI tests #212

Closed

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from 51ea48f to 4965b8b Compare December 13, 2023 17:07

travisn added the debug-ci label Dec 13, 2023

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from 5fbca4b to 064d879 Compare December 13, 2023 23:16

Javlopez removed the debug-ci label Dec 13, 2023

Javlopez requested a review from travisn December 14, 2023 00:33

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 2 times, most recently from 8b532ae to 651d087 Compare December 14, 2023 16:29

travisn requested changes Dec 14, 2023

View reviewed changes

pkg/crds/crds.go Outdated Show resolved Hide resolved

pkg/crds/crds.go Outdated Show resolved Hide resolved

pkg/crds/crds.go Outdated Show resolved Hide resolved

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 2 times, most recently from 5c1b3b6 to 9facebd Compare December 14, 2023 19:59

Javlopez requested a review from travisn December 14, 2023 19:59

travisn approved these changes Dec 15, 2023

View reviewed changes

pkg/crds/crds.go Outdated Show resolved Hide resolved

commands: add command to wiping cluster

f067748

Added command destroy-cluster for remove all crds Added documentation for destroy-cluster Added unittest and integration tests Added delete toolbox deployment Signed-off-by: Javier <sjavierlopez@gmail.com>

Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from 9facebd to f067748 Compare December 15, 2023 16:56

subhamkrai approved these changes Dec 18, 2023

View reviewed changes

BlaineEXE approved these changes Dec 19, 2023

View reviewed changes

travisn merged commit b572c39 into rook:master Dec 19, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

commands: add command to wiping cluster #173

commands: add command to wiping cluster #173

Javlopez commented Sep 12, 2023 •

edited by travisn

Loading

BlaineEXE left a comment •

edited

Loading

travisn commented Sep 20, 2023

Javlopez commented Sep 20, 2023

BlaineEXE commented Sep 20, 2023

travisn commented Sep 20, 2023

BlaineEXE commented Sep 20, 2023

subhamkrai left a comment

subhamkrai left a comment

travisn left a comment

travisn left a comment

commands: add command to wiping cluster #173

commands: add command to wiping cluster #173

Conversation

Javlopez commented Sep 12, 2023 • edited by travisn Loading

BlaineEXE left a comment • edited Loading

Choose a reason for hiding this comment

travisn commented Sep 20, 2023

Javlopez commented Sep 20, 2023

BlaineEXE commented Sep 20, 2023

travisn commented Sep 20, 2023

BlaineEXE commented Sep 20, 2023

subhamkrai left a comment

Choose a reason for hiding this comment

subhamkrai left a comment

Choose a reason for hiding this comment

travisn left a comment

Choose a reason for hiding this comment

travisn left a comment

Choose a reason for hiding this comment

Javlopez commented Sep 12, 2023 •

edited by travisn

Loading

BlaineEXE left a comment •

edited

Loading