restore: restore crd after accidental deletion #170

subhamkrai · 2023-08-30T11:15:33Z

with these changes we'll be able to restore any
crd which stuck deletion due some dependencies
after after accidental deletion.
example:
kubectl rook-ceph -n restore <cr_name>

fixes: #68

travisn · 2023-08-30T21:25:05Z

cmd/commands/restore.go

+// RestoreCmd represents the restore commands
+var RestoreCmd = &cobra.Command{
+	Use:   "restore",
+	Short: "Reads the crd and crd name to restore. Ex: restore cephcluster my-cluster",


How about if the CR name is optional? If they only provide the CRD type name, then we can automatically restore all instances that have been marked for deletion.

Before making changes, I think thinking is this a valid condition to support since for deletion of a resource we need to pass the name of the resource.

Thoughts?

To clarify my suggestion, there are two cases:

If they pass the (optional) resource name, then check if that resource is marked for deletion, and then restore it.

If they don't pass the resource name, query for all resources of that type (the name would also be queried in this step), then restore the instances that are marked for deletion.

For CephCluster, there is most commonly only a single resource anyway, but for pools or other resource types perhaps there could be multiple accidentally marked for deletion.

To clarify my suggestion, there are two cases:

If they pass the (optional) resource name, then check if that resource is marked for deletion, and then restore it.

If they don't pass the resource name, query for all resources of that type (the name would also be queried in this step), then restore the instances that are marked for deletion.

I'm thinking of option 2, how to get the specific rook resource like cephcluster, cephfilesystem... and use rook API to fetch them with the right call.

For CephCluster, there is most commonly only a single resource anyway, but for pools or other resource types perhaps there could be multiple accidentally marked for deletion.

I'm thinking it may be better not to use the Rook API since then the CRD settings would be tied to the version of the API that this plugin is referencing. For this scenario, we need to make a copy of the CR, then delete it, then create it again. So we need to make sure all settings are preserved, independent of the Rook version. So if we can use more generic queries for the CRs it will be better.

subhamkrai · 2023-08-31T05:56:59Z

Also, I need help getting the right error message rather than just the error code from the below code

kubectl-rook-ceph/pkg/exec/bash.go

Lines 25 to 38 in e9dffaf

    
           func ExecuteBashCommand(command string) string { 
        
           	cmd := exec.Command("/bin/bash", 
        
           		"-x", // Print commands and their arguments as they are executed 
        
           		"-e", // Exit immediately if a command exits with a non-zero status. 
        
           		"-m", // Terminal job control, allows job to be terminated by SIGTERM 
        
           		"-c", // Command to run 
        
           		command, 
        
           	) 
        
           	stdout, err := cmd.Output() 
        
           	if err != nil { 
        
           		logging.Fatal(err) 
        
           	} 
        
           	return string(stdout) 
        
           }

for example: when running a kubectl command to create a resource that already exits, the above code throughs exit code 1 and then exits but what I really want is to print the Kubernetes error that Already exits ...... and then exit.

I tried a few things but failed. any suggestion?
@travisn @BlaineEXE

Note, I can use os.stdout, os.stderr but want to return the output from the method and not just log in to the terminal.

travisn

Also, I need help getting the right error message rather than just the error code from the below code

kubectl-rook-ceph/pkg/exec/bash.go

Lines 25 to 38 in e9dffaf

func ExecuteBashCommand(command string) string {

cmd := exec.Command("/bin/bash",

"-x", // Print commands and their arguments as they are executed

"-e", // Exit immediately if a command exits with a non-zero status.

"-m", // Terminal job control, allows job to be terminated by SIGTERM

"-c", // Command to run

command,

)

stdout, err := cmd.Output()

if err != nil {

logging.Fatal(err)

}

return string(stdout)

}

for example: when running a kubectl command to create a resource that already exits, the above code throughs exit code 1 and then exits but what I really want is to print the Kubernetes error that Already exits ...... and then exit.

I tried a few things but failed. any suggestion? @travisn @BlaineEXE

Note, I can use os.stdout, os.stderr but want to return the output from the method and not just log in to the terminal.

Do you see the "Already exists" error code in the stderr? Do we just need to look for that string since no error code is returned?

travisn · 2023-09-05T22:14:06Z

cmd/commands/restore.go

+// RestoreCmd represents the restore commands
+var RestoreCmd = &cobra.Command{
+	Use:   "restore",
+	Short: "Reads the crd and crd name to restore. Ex: restore cephcluster my-cluster",


I'm thinking it may be better not to use the Rook API since then the CRD settings would be tied to the version of the API that this plugin is referencing. For this scenario, we need to make a copy of the CR, then delete it, then create it again. So we need to make sure all settings are preserved, independent of the Rook version. So if we can use more generic queries for the CRs it will be better.

subhamkrai · 2023-09-06T10:06:46Z

Testing result

Test Passing argument cr and crName

 ./bin/kubectl-rook-ceph restore cephblockpool replicapool
Warning: rook version 'rook: v1.12.0-alpha.0.227.g3e248bf9a' is running a pre-release version of Rook.

Info: Restoring crd cephblockpool name replicapool
Info: Scaling down the operator to 0
Info: Backing up kubernetes and crd resources
Info: Backed up crd cephblockpool named replicapool in file cephblockpool.yaml
Info: Backed up secret in file secret.yaml
Info: Backed up configmap in file configmap.yaml
Info: deleting validating webhook rook-ceph-webhook if present
Info: Fetching cephblockpool replicapool uid
Info: Successfully fetched uid 0f8b59df-816d-4cc8-b888-ec128b14c308 from cephblockpool replicapool
Info: removing ownerreferences from resources with matching uid 0f8b59df-816d-4cc8-b888-ec128b14c308
Info: removing finalizers from cephblockpool replicapool
Info: cephblockpool.ceph.rook.io/replicapool patched

Info: re-creating the cr cephblockpool from file replicapool.yaml created above
Info: cephblockpool.ceph.rook.io/replicapool created

Info: Scaling up the operator to 1
Info: CR should be successfully restored. Please watch the operator logs and check the crd
~/go/src/github.com/kubectl-rook-ceph
~/go/src/github.com/kubectl-rook-ceph
srai@fedora ~ (restore-crd) $ kc get cephblockpool
NAME          PHASE
replicapool   Progressing
~/go/src/github.com/kubectl-rook-ceph
srai@fedora ~ (restore-crd) $ kc get cephblockpool
NAME          PHASE
replicapool   Ready

Test passing only cephcluster

./bin/kubectl-rook-ceph  restore cephcluster
Warning: rook version 'rook: v1.12.0-alpha.0.227.g3e248bf9a' is running a pre-release version of Rook.

Info: Restoring crd cephcluster name my-cluster
Info: Scaling down the operator to 0
Info: Backing up kubernetes and crd resources
Info: Backed up crd cephcluster named my-cluster in file cephcluster.yaml
Info: Backed up secret in file secret.yaml
Info: Backed up configmap in file configmap.yaml
Info: deleting validating webhook rook-ceph-webhook if present
Info: Fetching cephcluster my-cluster uid
Info: Successfully fetched uid a20a0b18-52d5-4844-a542-782e2d224f98 from cephcluster my-cluster
Info: removing ownerreferences from resources with matching uid a20a0b18-52d5-4844-a542-782e2d224f98
Info: removing owner references for service rook-ceph-exporter
Info: Removed ownerReference for service: rook-ceph-exporter

Info: removing finalizers from cephcluster my-cluster
Info: cephcluster.ceph.rook.io/my-cluster patched

Info: re-creating the cr cephcluster from file my-cluster.yaml created above
Info: cephcluster.ceph.rook.io/my-cluster created

Info: Scaling up the operator to 1
Info: CR should be successfully restored. Please watch the operator logs and check the crd
~/go/src/github.com/kubectl-rook-ceph
srai@fedora ~ (restore-crd) $ kc get cephcluster
NAME         DATADIRHOSTPATH   MONCOUNT   AGE     PHASE   MESSAGE                        HEALTH      EXTERNAL   FSID
my-cluster   /var/lib/rook     1          7m44s   Ready   Cluster created successfully   HEALTH_OK              5dc8aca4-cb12-47ee-a83e-a9f773b39a0a

subhamkrai · 2023-09-06T10:12:37Z

Also, I need help getting the right error message rather than just the error code from the below code

kubectl-rook-ceph/pkg/exec/bash.go

Lines 25 to 38 in e9dffaf

func ExecuteBashCommand(command string) string {

cmd := exec.Command("/bin/bash",

"-x", // Print commands and their arguments as they are executed

"-e", // Exit immediately if a command exits with a non-zero status.

"-m", // Terminal job control, allows job to be terminated by SIGTERM

"-c", // Command to run

command,

)

stdout, err := cmd.Output()

if err != nil {

logging.Fatal(err)

}

return string(stdout)

}

for example: when running a kubectl command to create a resource that already exits, the above code throughs exit code 1 and then exits but what I really want is to print the Kubernetes error that Already exits ...... and then exit.
I tried a few things but failed. any suggestion? @travisn @BlaineEXE
Note, I can use os.stdout, os.stderr but want to return the output from the method and not just log in to the terminal.

Do you see the "Already exists" error code in the stderr? Do we just need to look for that string since no error code is returned?

yes, In the error stderr I saw the right error string but after checking the error from cmd. output it has error code.
Anyway, everything is working as expected now. I had to add cmd.Stderr = os.Stderr and had to remove few args when running command directly using os/exec

subhamkrai · 2023-09-06T10:15:12Z

@travisn I was thinking of adding CI test for this as another pr or do you think I should include it in this pr only?

travisn

@travisn I was thinking of adding CI test for this as another pr or do you think I should include it in this pr only?

Let's have the CI test with this PR. It's critical that we test this feature fully before anyone tries to use it.

pkg/restore/restore_crd.go

cmd/commands/restore.go

pkg/restore/restore_crd.go

subhamkrai · 2023-09-07T08:05:40Z

@travisn @Madhu-1 Let's discuss the error in this thread.

So, currently, the flow is if we get any error after scaling down to the rook operator pod, and before scaling up the rook operator pod, I'm scaling back the rook operator first and then returning the error.

What I'm gathering from the comments now are,

We first want to return the error first and scale back the rook operator
We'll leave it to the users to scale back the rook operator.

Let me know your thoughts.

Madhu-1 · 2023-09-07T08:14:21Z

@travisn @Madhu-1 Let's discuss the error in this thread.

So, currently, the flow is if we get any error after scaling down to the rook operator pod, and before scaling up the rook operator pod, I'm scaling back the rook operator first and then returning the error.

What I'm gathering from the comments now are,

We first want to return the error first and scale back the rook operator

We'll leave it to the users to scale back the rook operator.

Let me know your thoughts.

IMHO scaling back the operator makes sense only if it is scaled down by the tool not by the user, the user might have scaled it down for some other reasons.

subhamkrai · 2023-09-07T08:15:06Z

@travisn @Madhu-1 Let's discuss the error in this thread.
So, currently, the flow is if we get any error after scaling down to the rook operator pod, and before scaling up the rook operator pod, I'm scaling back the rook operator first and then returning the error.
What I'm gathering from the comments now are,

We first want to return the error first and scale back the rook operator

We'll leave it to the users to scale back the rook operator.

Let me know your thoughts.

IMHO scaling back the operator makes sense only if it is scaled down by the tool not by the user, the user might have scaled it down for some other reasons.

Yes, for this command tool scale down the operator

travisn · 2023-09-07T20:05:46Z

pkg/restore/crd.go

+)
+
+var getCrName = `
+	kubectl -n %s get %s -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.deletionGracePeriodSeconds}{"\n"}{end}' | awk '$2=="0" {print $1}'


currently, the tool is doesn't have the capability of finding if it kubectl or oc. But I think it will be better if in the future if the tool is able to find the differences.

If we can't autodetect it yet, we need a CLI arg that lets them override it to oc.

pkg/restore/crd.go

pkg/restore/restore_crd.go

pkg/restore/crd.go

pkg/restore/restore_crd.go

.github/workflows/go-test.yaml

pkg/restore/crd.go

.github/workflows/go-test.yaml

subhamkrai · 2023-09-19T13:48:40Z

depends #180

.github/workflows/go-test.yaml

cmd/commands/restore.go

docs/crd.md

pkg/restore/crd.go

docs/crd.md

pkg/restore/crd.go

docs/crd.md

pkg/restore/crd.go

pkg/mons/restore_quorum.go

subhamkrai · 2023-09-28T05:17:36Z

cmd/commands/restore.go

+
+// RestoreCmd represents the restore commands
+var RestoreCmd = &cobra.Command{
+	Use:   "restore-deleted",


Suggested change

Use: "restore-deleted",

Use: "restore-deleted",

any suggestion on the name?

undelete is another option. restore-deleted still sounds better to me, but we could take a vote. Other thoughts?

I'm okay with restore-deleted

travisn · 2023-09-28T18:04:05Z

cmd/commands/restore.go

+
+// RestoreCmd represents the restore commands
+var RestoreCmd = &cobra.Command{
+	Use:   "restore-deleted",


undelete is another option. restore-deleted still sounds better to me, but we could take a vote. Other thoughts?

docs/crd.md

pkg/mons/restore_quorum.go

pkg/restore/crd.go

travisn

Just a few small suggestions

travisn · 2023-10-13T13:50:59Z

docs/crd.md

+While the underlying Ceph data and daemons continue to be available, the CRs will be stuck indefinitely in a Deleting state in which the operator will not continue to ensure cluster health. Upgrades will be blocked, further updates to the CRs are prevented, and so on. Since Kubernetes does not allow undeleting resources, the command below will allow repairing the CRs without even necessarily suffering cluster downtime.
+
+> [!NOTE]
+> If there are multiple deleted resources in the cluster and no specific resource is mentioned, in that case the first resource will be restored and in order to restore all resources we need to re-run the command.


Suggested change

> If there are multiple deleted resources in the cluster and no specific resource is mentioned, in that case the first resource will be restored and in order to restore all resources we need to re-run the command.

> If there are multiple deleted resources in the cluster and no specific resource is mentioned, the first resource will be restored. To restore all deleted resources, re-run the command multiple times.

Actually, could we just loop through all the resources?

Yes, we could loop through all the resources but IMO it's better to restore one resource at a time. It's better for monitoring and debugging if something goes wrong. Also, users may get overwhelmed with all the log messages.

Also, we later decide to restore all the resources at once. In that case, it will be better if we do so after Javier's PR since he is adding the code to list resources from dynamic API, it will help in looping over the resource. And currently we are getting the resource from the command and then reading it as a string.

Ok we can do this for now, it's really a corner case anyway.

pkg/restore/crd.go

with these changes we'll be able to restore any crd which stuck deletion due some dependencies after after accidental deletion. example: kubectl rook-ceph -n <ns> restore <cr> <cr_name> Signed-off-by: subhamkrai <srai@redhat.com>

subhamkrai force-pushed the restore-crd branch from df48cdf to 386f6da Compare August 30, 2023 11:19

travisn reviewed Aug 30, 2023

View reviewed changes

travisn reviewed Sep 5, 2023

View reviewed changes

subhamkrai force-pushed the restore-crd branch from 386f6da to b2f94b5 Compare September 6, 2023 10:02

subhamkrai marked this pull request as ready for review September 6, 2023 10:02

subhamkrai requested a review from travisn September 6, 2023 10:06

travisn requested changes Sep 6, 2023

View reviewed changes

Madhu-1 reviewed Sep 7, 2023

View reviewed changes

subhamkrai force-pushed the restore-crd branch 6 times, most recently from 180be3d to 63b6e66 Compare September 7, 2023 15:29

travisn requested changes Sep 7, 2023

View reviewed changes

subhamkrai force-pushed the restore-crd branch 6 times, most recently from df3a05d to 04a6e9f Compare September 11, 2023 11:34

subhamkrai requested a review from travisn September 11, 2023 11:46

travisn requested changes Sep 11, 2023

View reviewed changes

.github/workflows/go-test.yaml Outdated Show resolved Hide resolved

pkg/restore/crd.go Outdated Show resolved Hide resolved

pkg/restore/crd.go Outdated Show resolved Hide resolved

.github/workflows/go-test.yaml Show resolved Hide resolved

subhamkrai added the do-not-merge label Sep 19, 2023

BlaineEXE reviewed Sep 19, 2023

View reviewed changes

.github/workflows/go-test.yaml Outdated Show resolved Hide resolved

travisn requested changes Sep 19, 2023

View reviewed changes

subhamkrai force-pushed the restore-crd branch 2 times, most recently from ad52fa4 to 7e62bce Compare September 22, 2023 07:04

subhamkrai removed the do-not-merge label Sep 22, 2023

subhamkrai force-pushed the restore-crd branch from 7e62bce to 0f10b5b Compare September 22, 2023 07:34

subhamkrai requested review from travisn and BlaineEXE September 22, 2023 08:38

travisn requested changes Sep 26, 2023

View reviewed changes

docs/crd.md Outdated Show resolved Hide resolved

docs/crd.md Outdated Show resolved Hide resolved

pkg/restore/crd.go Show resolved Hide resolved

pkg/restore/crd.go Show resolved Hide resolved

docs/crd.md Outdated Show resolved Hide resolved

subhamkrai force-pushed the restore-crd branch 3 times, most recently from eb37fb2 to 4cd49d0 Compare September 27, 2023 04:10

travisn requested changes Sep 27, 2023

View reviewed changes

docs/crd.md Show resolved Hide resolved

docs/crd.md Outdated Show resolved Hide resolved

docs/crd.md Outdated Show resolved Hide resolved

pkg/restore/crd.go Show resolved Hide resolved

pkg/mons/restore_quorum.go Outdated Show resolved Hide resolved

subhamkrai force-pushed the restore-crd branch from 4cd49d0 to 684e910 Compare September 28, 2023 05:14

subhamkrai commented Sep 28, 2023

View reviewed changes

subhamkrai requested a review from travisn September 28, 2023 16:19

travisn requested changes Sep 28, 2023

View reviewed changes

travisn mentioned this pull request Oct 4, 2023

commands: add command to wiping cluster #173

Merged

subhamkrai force-pushed the restore-crd branch 2 times, most recently from 30e6fa1 to fb95687 Compare October 12, 2023 10:33

subhamkrai requested a review from travisn October 12, 2023 10:50

travisn requested changes Oct 13, 2023

View reviewed changes

restore: restore crd after accidental deletion

110cbe2

with these changes we'll be able to restore any crd which stuck deletion due some dependencies after after accidental deletion. example: kubectl rook-ceph -n <ns> restore <cr> <cr_name> Signed-off-by: subhamkrai <srai@redhat.com>

subhamkrai force-pushed the restore-crd branch from fb95687 to 110cbe2 Compare October 16, 2023 06:17

subhamkrai requested a review from travisn October 16, 2023 06:22

travisn approved these changes Oct 16, 2023

View reviewed changes

subhamkrai merged commit 7e8ae3e into rook:master Oct 17, 2023
5 checks passed

subhamkrai deleted the restore-crd branch October 17, 2023 02:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restore: restore crd after accidental deletion #170

restore: restore crd after accidental deletion #170

subhamkrai commented Aug 30, 2023

travisn Aug 30, 2023

subhamkrai Aug 31, 2023

travisn Aug 31, 2023

subhamkrai Sep 5, 2023

travisn Sep 5, 2023

subhamkrai commented Aug 31, 2023

travisn left a comment

travisn Sep 5, 2023

subhamkrai commented Sep 6, 2023

subhamkrai commented Sep 6, 2023

subhamkrai commented Sep 6, 2023

travisn left a comment

subhamkrai commented Sep 7, 2023

Madhu-1 commented Sep 7, 2023

subhamkrai commented Sep 7, 2023

travisn Sep 7, 2023

subhamkrai commented Sep 19, 2023

subhamkrai Sep 28, 2023

travisn Sep 28, 2023

subhamkrai Oct 12, 2023

travisn Sep 28, 2023

travisn left a comment

travisn Oct 13, 2023

travisn Oct 13, 2023

subhamkrai Oct 16, 2023

travisn Oct 16, 2023

	func ExecuteBashCommand(command string) string {
	cmd := exec.Command("/bin/bash",
	"-x", // Print commands and their arguments as they are executed
	"-e", // Exit immediately if a command exits with a non-zero status.
	"-m", // Terminal job control, allows job to be terminated by SIGTERM
	"-c", // Command to run
	command,
	)
	stdout, err := cmd.Output()
	if err != nil {
	logging.Fatal(err)
	}
	return string(stdout)
	}

	> If there are multiple deleted resources in the cluster and no specific resource is mentioned, in that case the first resource will be restored and in order to restore all resources we need to re-run the command.
	> If there are multiple deleted resources in the cluster and no specific resource is mentioned, the first resource will be restored. To restore all deleted resources, re-run the command multiple times.

restore: restore crd after accidental deletion #170

restore: restore crd after accidental deletion #170

Conversation

subhamkrai commented Aug 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

subhamkrai commented Aug 31, 2023

travisn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

subhamkrai commented Sep 6, 2023

Test Passing argument cr and crName

Test passing only cephcluster

subhamkrai commented Sep 6, 2023

subhamkrai commented Sep 6, 2023

travisn left a comment

Choose a reason for hiding this comment

subhamkrai commented Sep 7, 2023

Madhu-1 commented Sep 7, 2023

subhamkrai commented Sep 7, 2023

Choose a reason for hiding this comment

subhamkrai commented Sep 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment