Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

commands: add command to wiping cluster #173

Merged

Conversation

Javlopez
Copy link
Contributor

@Javlopez Javlopez commented Sep 12, 2023

Resolves: #131

This Pr adds a new command to destroy rook cluster removing the CRDs and the toolbox deployment

@Javlopez Javlopez self-assigned this Sep 12, 2023
pkg/crds/remove.go Outdated Show resolved Hide resolved
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from f4c0309 to 9a1368b Compare September 20, 2023 15:11
Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of duplication to remove resources, and Rook adds/modifies resources occasionally. This code will likely be a lot of work to update over time, and my experience has shown that this tool will eventually stop being relevant due to skew from Rook over time. I would encourage you/us to consider how to re-implement this more generically so that we aren't (1) duplicating code and (2) giving ourselves a pretty large amount of ongoing maintenance burden for the forseeable future.

[Updates]

The k8s "dynamic" API allows working with resources without having to strongly type them:
https://itnext.io/generically-working-with-kubernetes-resources-in-go-53bce678f887

I'd suggest querying kubernetes for all CRD kinds, and then finding the CRDs that are ceph.rook.io, then deleting all of each Kind using the dynamic API. It can also look for OBCs as a special case, but it should do filtering to make sure it isn't deleting non-rook OBCs. For an initial version of this PR, it'll be cleaner to leave OBCs out and put them in later.

Likely, the only exception to the dynamic approach is CephCluster, which needs special handling to allow setting the cleanup policy.

@travisn
Copy link
Member

travisn commented Sep 20, 2023

Agreed, will help the code maintenance if we can avoid the strongly typed API. If we can use the dynamic API as suggested, I'm thinking we could just iterate of all resources in the rook namespace, delete them, and remove their finalizers. As suggested here let's not worry about the special case yet for the cluster cleanup policy.

When I say we should iterate over all resources, we would only iterate over the specific types we expect to delete such as all the ceph.rook.io CRs, the OBC CRs, specific secrets, and specific configmaps.

this tool will eventually stop being relevant due to skew from Rook over time

@BlaineEXE Are you referring to the strongly typed code being too much to maintain if we don't change it, or what will stop being relevant?

@Javlopez
Copy link
Contributor Author

thanks for the feedback @BlaineEXE totally agree, this PR is actually on draft for that reason I mean I was not sure that this way was the right one.

I gonna rework this PR based on your suggestion

@BlaineEXE
Copy link
Member

@BlaineEXE Are you referring to the strongly typed code being too much to maintain if we don't change it, or what will stop being relevant?

I think you and I are on the same page. I'll try to clarify what I mean here by "relevant," though in short, I mean "in/out of skew"

Unless the krew plugin version and Rook versions are either strongly coupled or strongly decoupled (what krew plugin assumes today), the implementation of this feature will likely skew slowly over time compared to ongoing changes in Rook. Eventually, the krew implementation will become sufficiently out of date compared that it will become frustrating for users.

This risk exists even with the "dynamic" API, but there are at least fewer interaction points that are likely to break or become stale.


When I say we should iterate over all resources, we would only iterate over the specific types we expect to delete such as all the ceph.rook.io CRs, the OBC CRs, specific secrets, and specific configmaps.

IMO, this is also still a little to hardcoded to be resilient to skew. I think the word "specific" is a good indicator.

I think the most resilient design would be to effectively try to generically delete anything and everything in a given namespace. The specific-to-rook part would be to delete all CRs that exist for .rook.ceph.io CRDs. If we know there are problems with configmaps or secrets not getting deleted, then we should also delete those generically.

I would also suggest leaving OBCs out of the implementation at least initially, if not fully. OBCs can exist for other providers like Noobaa, and the implementation should attempt to not delete other providers' OBCs. Consider that this tool likely isn't going to remove user-created applications that are consuming rbd/cephfs PVCs. And likely, this tool won't implement deletion behavior for COSI resources in the future. I think it's easiest for developers to just stop deletions with this tool outside of the Rook API boundary.


I'd suggest that this be implemented in two separate PRs... First to clean the cluster without the cleanupPolicy options, then later to also implement the cleanupPolicy.

Having written, re-written, and then re-re-written a personal bash script for uninstalling Rook, I personally think that including cleanupPolicy in the initial design would actually be best overall. It's not much more code or difficulty, and the methodology for watching for Rook resources stuck in deletion will change more than you might expect between cleaupPolicy "on" versus "off" cases.

I also base that on an assumption that this tool execution does not allow users to choose whether to clean up or not. I assume it to be for development and proof-of-concept clusters only and not something to use on production clusters. IMO, this should only be for dev/poc clusters. I would really avoid doing this for production clusters because it would make it wildly easy for someone to destroy their storage

@travisn
Copy link
Member

travisn commented Sep 20, 2023

@BlaineEXE Are you referring to the strongly typed code being too much to maintain if we don't change it, or what will stop being relevant?

I think you and I are on the same page. I'll try to clarify what I mean here by "relevant," though in short, I mean "in/out of skew"

Unless the krew plugin version and Rook versions are either strongly coupled or strongly decoupled (what krew plugin assumes today), the implementation of this feature will likely skew slowly over time compared to ongoing changes in Rook. Eventually, the krew implementation will become sufficiently out of date compared that it will become frustrating for users.

This risk exists even with the "dynamic" API, but there are at least fewer interaction points that are likely to break or become stale.

The goal of this command as discussed in the original issue #131 is really a "best effort" tool to help the admin clean up a cluster where they are absolutely done with Rook. We know it's quite frustrating today to cleanup up a cluster, thus the desire to improve the admin's hope of cleaning up a cluster. Whether it's a test cluster or a production cluster, the admin could choose to use this destructive command if they choose, but that's why we are proposing the flag to ask if they really really really mean it.

Over time as we add more CRDs or other resources, the tool should also be updated to clean up those resources. This seems to me like normal software maintenance, I don't see the extra concern for skew.

When I say we should iterate over all resources, we would only iterate over the specific types we expect to delete such as all the ceph.rook.io CRs, the OBC CRs, specific secrets, and specific configmaps.

IMO, this is also still a little to hardcoded to be resilient to skew. I think the word "specific" is a good indicator.

I think the most resilient design would be to effectively try to generically delete anything and everything in a given namespace. The specific-to-rook part would be to delete all CRs that exist for .rook.ceph.io CRDs. If we know there are problems with configmaps or secrets not getting deleted, then we should also delete those generically.

As long as we only delete resources we know are related to Rook, that is the goal IMO. My point was more that non-rook types such as secrets and configmaps need to be carefully deleted since we only want to delete our own. If the admin is using the namespace for anything else, we don't want to delete those.

I would also suggest leaving OBCs out of the implementation at least initially, if not fully. OBCs can exist for other providers like Noobaa, and the implementation should attempt to not delete other providers' OBCs. Consider that this tool likely isn't going to remove user-created applications that are consuming rbd/cephfs PVCs. And likely, this tool won't implement deletion behavior for COSI resources in the future. I think it's easiest for developers to just stop deletions with this tool outside of the Rook API boundary.

Agreed on the OBCs. To make it more clear, I would expect the tool only to delete Rook resources, and not any application resources. OBCs, PVCs, etc all belong to the application. But the tool would be deleting the object store and OBs in the namespace, so the OBCs would be effectively broken just like any rbd or cephfs PVCs.

I'd suggest that this be implemented in two separate PRs... First to clean the cluster without the cleanupPolicy options, then later to also implement the cleanupPolicy.

Having written, re-written, and then re-re-written a personal bash script for uninstalling Rook, I personally think that including cleanupPolicy in the initial design would actually be best overall. It's not much more code or difficulty, and the methodology for watching for Rook resources stuck in deletion will change more than you might expect between cleaupPolicy "on" versus "off" cases.

My assumption is that implementing the cleanupPolicy will complicate the first pass of the implementation. But if this is anyway the main use case (see next paragraph), then we could certainly do it with this PR.

I also base that on an assumption that this tool execution does not allow users to choose whether to clean up or not. I assume it to be for development and proof-of-concept clusters only and not something to use on production clusters. IMO, this should only be for dev/poc clusters. I would really avoid doing this for production clusters because it would make it wildly easy for someone to destroy their storage

While it's certainly not expected to use this on production clusters where data is still there, the admin could decide they are done with a production cluster and want to destroy all the data. While it really is recommended they go through the uninstall guide and remove applications and cleanup rook resources in the correct order, I don't think we should prevent the tool from being used in that scenario either.

If this command is most useful for development scenarios so we can avoid scripts like you've rewritten several times, that will be the best scenario for maintaining this tool anyway. It's the perfect scenario to maintain because if the developers see that it doesn't clean up properly, we can fix it.

@BlaineEXE
Copy link
Member

Over time as we add more CRDs or other resources, the tool should also be updated to clean up those resources. This seems to me like normal software maintenance, I don't see the extra concern for skew.

It is true that skew is a risk and software must be maintained. We can't escape from that. But I want to offer a recommendation that we consider the maintenance burden of the implementation and to offer my experience that the more we code for "specific" resources or resource types, the more maintenance it will take to keep this script working over time.

As long as we only delete resources we know are related to Rook, that is the goal IMO. My point was more that non-rook types such as secrets and configmaps need to be carefully deleted since we only want to delete our own. If the admin is using the namespace for anything else, we don't want to delete those.

This is a good point. Still, I think we could look for configmaps/secrets that reference the CephCluster or some other CephXYZ kind as a resource owner rather than hardcoding resource names.

If this command is most useful for development scenarios so we can avoid scripts like you've rewritten several times, that will be the best scenario for maintaining this tool anyway. It's the perfect scenario to maintain because if the developers see that it doesn't clean up properly, we can fix it.

You are right that using this for dev scenarios is important because developers will regularly see if the script is not working. I would also like to offer that this is a double-edged sword. If the script is not well implemented, it will be a burden on developers who are regularly faced with a broken tool and who get exhausted trying to maintain it as it breaks and needs fixed. I have seen development tools be abandoned over time because they just end up being too much of a constant burden compared to their benefit, and if that happens with this tool, users will lose out as well.

I'm not at all saying "don't do this." What I am trying to say is that, based on my experience, planning the design and implementation of this from onset will save quite a lot developer headache in the long run. In my experience, the most robust design here will minimize reliance on Rook-specific APIs as much as possible. I would also encourage considering also consider resource and label names to be part of Rook's implied API. A robust design will center around Rook's most core and unchanging API points. That will be the .ceph.rook.io Resource/Kind naming, which is I would guess the least likely thing to change about Rook's implementation in the next 10+ years.

Copy link
Collaborator

@subhamkrai subhamkrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Javlopez please add CI tests for this. Thanks

pkg/crds/crds.go Outdated Show resolved Hide resolved
pkg/crds/crds.go Outdated Show resolved Hide resolved
pkg/crds/crds.go Outdated Show resolved Hide resolved
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 3 times, most recently from 5ffce6a to e976eca Compare September 27, 2023 03:11
Copy link
Collaborator

@subhamkrai subhamkrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Javlopez Thanks for the PR!

cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
cmd/commands/root.go Show resolved Hide resolved
pkg/crds/crds.go Show resolved Hide resolved
pkg/crds/crds_test.go Show resolved Hide resolved
pkg/k8sutil/interface.go Show resolved Hide resolved
cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
pkg/crds/crds_test.go Show resolved Hide resolved
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from 3796dd2 to cd7ebc7 Compare October 1, 2023 00:33
@Javlopez Javlopez marked this pull request as ready for review October 1, 2023 00:34
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from cd7ebc7 to abb3f42 Compare October 1, 2023 00:38
cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
docs/destroy_cluster.md Outdated Show resolved Hide resolved
cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 4 times, most recently from d76512c to 3a10c07 Compare October 3, 2023 02:29
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from f5cddd8 to 07e2903 Compare November 16, 2023 15:42
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great now, thanks! Just a couple final suggestions on the messages.

cmd/commands/destroy_cluster.go Outdated Show resolved Hide resolved
docs/destroy-cluster.md Show resolved Hide resolved
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 2 times, most recently from 87163d2 to 8691f05 Compare November 29, 2023 20:52
pkg/crds/crds.go Show resolved Hide resolved
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 5 times, most recently from ccc7962 to 3542613 Compare December 12, 2023 02:13
pkg/crds/crds.go Outdated Show resolved Hide resolved
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from 51ea48f to 4965b8b Compare December 13, 2023 17:07
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from 5fbca4b to 064d879 Compare December 13, 2023 23:16
@Javlopez Javlopez removed the debug-ci label Dec 13, 2023
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 2 times, most recently from 8b532ae to 651d087 Compare December 14, 2023 16:29
pkg/crds/crds.go Outdated Show resolved Hide resolved
pkg/crds/crds.go Outdated Show resolved Hide resolved
pkg/crds/crds.go Outdated Show resolved Hide resolved
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch 2 times, most recently from 5c1b3b6 to 9facebd Compare December 14, 2023 19:59
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! just a minor suggestion in the logging

pkg/crds/crds.go Outdated Show resolved Hide resolved
Added command destroy-cluster for remove all crds
Added documentation for destroy-cluster
Added unittest and integration tests
Added delete toolbox deployment

Signed-off-by: Javier <sjavierlopez@gmail.com>
@Javlopez Javlopez force-pushed the feature/add-command-for-wipping-cluster branch from 9facebd to f067748 Compare December 15, 2023 16:56
@travisn travisn merged commit b572c39 into rook:master Dec 19, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a command for wiping a cluster
5 participants