-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
commands: add command to wiping cluster #173
commands: add command to wiping cluster #173
Conversation
f4c0309
to
9a1368b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of duplication to remove resources, and Rook adds/modifies resources occasionally. This code will likely be a lot of work to update over time, and my experience has shown that this tool will eventually stop being relevant due to skew from Rook over time. I would encourage you/us to consider how to re-implement this more generically so that we aren't (1) duplicating code and (2) giving ourselves a pretty large amount of ongoing maintenance burden for the forseeable future.
[Updates]
The k8s "dynamic" API allows working with resources without having to strongly type them:
https://itnext.io/generically-working-with-kubernetes-resources-in-go-53bce678f887
I'd suggest querying kubernetes for all CRD kinds, and then finding the CRDs that are ceph.rook.io
, then deleting all of each Kind using the dynamic API. It can also look for OBCs as a special case, but it should do filtering to make sure it isn't deleting non-rook OBCs. For an initial version of this PR, it'll be cleaner to leave OBCs out and put them in later.
Likely, the only exception to the dynamic approach is CephCluster, which needs special handling to allow setting the cleanup policy.
Agreed, will help the code maintenance if we can avoid the strongly typed API. If we can use the dynamic API as suggested, I'm thinking we could just iterate of all resources in the rook namespace, delete them, and remove their finalizers. As suggested here let's not worry about the special case yet for the cluster cleanup policy. When I say we should iterate over all resources, we would only iterate over the specific types we expect to delete such as all the
@BlaineEXE Are you referring to the strongly typed code being too much to maintain if we don't change it, or what will stop being relevant? |
thanks for the feedback @BlaineEXE totally agree, this PR is actually on draft for that reason I mean I was not sure that this way was the right one. I gonna rework this PR based on your suggestion |
I think you and I are on the same page. I'll try to clarify what I mean here by "relevant," though in short, I mean "in/out of skew" Unless the krew plugin version and Rook versions are either strongly coupled or strongly decoupled (what krew plugin assumes today), the implementation of this feature will likely skew slowly over time compared to ongoing changes in Rook. Eventually, the krew implementation will become sufficiently out of date compared that it will become frustrating for users. This risk exists even with the "dynamic" API, but there are at least fewer interaction points that are likely to break or become stale.
IMO, this is also still a little to hardcoded to be resilient to skew. I think the word "specific" is a good indicator. I think the most resilient design would be to effectively try to generically delete anything and everything in a given namespace. The specific-to-rook part would be to delete all CRs that exist for I would also suggest leaving OBCs out of the implementation at least initially, if not fully. OBCs can exist for other providers like Noobaa, and the implementation should attempt to not delete other providers' OBCs. Consider that this tool likely isn't going to remove user-created applications that are consuming rbd/cephfs PVCs. And likely, this tool won't implement deletion behavior for COSI resources in the future. I think it's easiest for developers to just stop deletions with this tool outside of the Rook API boundary.
Having written, re-written, and then re-re-written a personal bash script for uninstalling Rook, I personally think that including I also base that on an assumption that this tool execution does not allow users to choose whether to clean up or not. I assume it to be for development and proof-of-concept clusters only and not something to use on production clusters. IMO, this should only be for dev/poc clusters. I would really avoid doing this for production clusters because it would make it wildly easy for someone to destroy their storage |
The goal of this command as discussed in the original issue #131 is really a "best effort" tool to help the admin clean up a cluster where they are absolutely done with Rook. We know it's quite frustrating today to cleanup up a cluster, thus the desire to improve the admin's hope of cleaning up a cluster. Whether it's a test cluster or a production cluster, the admin could choose to use this destructive command if they choose, but that's why we are proposing the flag to ask if they really really really mean it. Over time as we add more CRDs or other resources, the tool should also be updated to clean up those resources. This seems to me like normal software maintenance, I don't see the extra concern for skew.
As long as we only delete resources we know are related to Rook, that is the goal IMO. My point was more that non-rook types such as secrets and configmaps need to be carefully deleted since we only want to delete our own. If the admin is using the namespace for anything else, we don't want to delete those.
Agreed on the OBCs. To make it more clear, I would expect the tool only to delete Rook resources, and not any application resources. OBCs, PVCs, etc all belong to the application. But the tool would be deleting the object store and OBs in the namespace, so the OBCs would be effectively broken just like any rbd or cephfs PVCs.
My assumption is that implementing the cleanupPolicy will complicate the first pass of the implementation. But if this is anyway the main use case (see next paragraph), then we could certainly do it with this PR.
While it's certainly not expected to use this on production clusters where data is still there, the admin could decide they are done with a production cluster and want to destroy all the data. While it really is recommended they go through the uninstall guide and remove applications and cleanup rook resources in the correct order, I don't think we should prevent the tool from being used in that scenario either. If this command is most useful for development scenarios so we can avoid scripts like you've rewritten several times, that will be the best scenario for maintaining this tool anyway. It's the perfect scenario to maintain because if the developers see that it doesn't clean up properly, we can fix it. |
It is true that skew is a risk and software must be maintained. We can't escape from that. But I want to offer a recommendation that we consider the maintenance burden of the implementation and to offer my experience that the more we code for "specific" resources or resource types, the more maintenance it will take to keep this script working over time.
This is a good point. Still, I think we could look for configmaps/secrets that reference the
You are right that using this for dev scenarios is important because developers will regularly see if the script is not working. I would also like to offer that this is a double-edged sword. If the script is not well implemented, it will be a burden on developers who are regularly faced with a broken tool and who get exhausted trying to maintain it as it breaks and needs fixed. I have seen development tools be abandoned over time because they just end up being too much of a constant burden compared to their benefit, and if that happens with this tool, users will lose out as well. I'm not at all saying "don't do this." What I am trying to say is that, based on my experience, planning the design and implementation of this from onset will save quite a lot developer headache in the long run. In my experience, the most robust design here will minimize reliance on Rook-specific APIs as much as possible. I would also encourage considering also consider resource and label names to be part of Rook's implied API. A robust design will center around Rook's most core and unchanging API points. That will be the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Javlopez please add CI tests for this. Thanks
5ffce6a
to
e976eca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Javlopez Thanks for the PR!
3796dd2
to
cd7ebc7
Compare
cd7ebc7
to
abb3f42
Compare
d76512c
to
3a10c07
Compare
f5cddd8
to
07e2903
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works great now, thanks! Just a couple final suggestions on the messages.
87163d2
to
8691f05
Compare
ccc7962
to
3542613
Compare
51ea48f
to
4965b8b
Compare
5fbca4b
to
064d879
Compare
8b532ae
to
651d087
Compare
5c1b3b6
to
9facebd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! just a minor suggestion in the logging
Added command destroy-cluster for remove all crds Added documentation for destroy-cluster Added unittest and integration tests Added delete toolbox deployment Signed-off-by: Javier <sjavierlopez@gmail.com>
9facebd
to
f067748
Compare
Resolves: #131
This Pr adds a new command to destroy rook cluster removing the CRDs and the toolbox deployment