Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restore: restore crd after accidental deletion #170

Merged
merged 1 commit into from
Oct 17, 2023

Conversation

subhamkrai
Copy link
Collaborator

with these changes we'll be able to restore any
crd which stuck deletion due some dependencies
after after accidental deletion.
example:
kubectl rook-ceph -n restore <cr_name>

fixes: #68

// RestoreCmd represents the restore commands
var RestoreCmd = &cobra.Command{
Use: "restore",
Short: "Reads the crd and crd name to restore. Ex: restore cephcluster my-cluster",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if the CR name is optional? If they only provide the CRD type name, then we can automatically restore all instances that have been marked for deletion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before making changes, I think thinking is this a valid condition to support since for deletion of a resource we need to pass the name of the resource.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my suggestion, there are two cases:

  1. If they pass the (optional) resource name, then check if that resource is marked for deletion, and then restore it.
  2. If they don't pass the resource name, query for all resources of that type (the name would also be queried in this step), then restore the instances that are marked for deletion.

For CephCluster, there is most commonly only a single resource anyway, but for pools or other resource types perhaps there could be multiple accidentally marked for deletion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my suggestion, there are two cases:

  1. If they pass the (optional) resource name, then check if that resource is marked for deletion, and then restore it.
  2. If they don't pass the resource name, query for all resources of that type (the name would also be queried in this step), then restore the instances that are marked for deletion.

I'm thinking of option 2, how to get the specific rook resource like cephcluster, cephfilesystem... and use rook API to fetch them with the right call.

For CephCluster, there is most commonly only a single resource anyway, but for pools or other resource types perhaps there could be multiple accidentally marked for deletion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it may be better not to use the Rook API since then the CRD settings would be tied to the version of the API that this plugin is referencing. For this scenario, we need to make a copy of the CR, then delete it, then create it again. So we need to make sure all settings are preserved, independent of the Rook version. So if we can use more generic queries for the CRs it will be better.

@subhamkrai
Copy link
Collaborator Author

Also, I need help getting the right error message rather than just the error code from the below code

func ExecuteBashCommand(command string) string {
cmd := exec.Command("/bin/bash",
"-x", // Print commands and their arguments as they are executed
"-e", // Exit immediately if a command exits with a non-zero status.
"-m", // Terminal job control, allows job to be terminated by SIGTERM
"-c", // Command to run
command,
)
stdout, err := cmd.Output()
if err != nil {
logging.Fatal(err)
}
return string(stdout)
}

for example: when running a kubectl command to create a resource that already exits, the above code throughs exit code 1 and then exits but what I really want is to print the Kubernetes error that Already exits ...... and then exit.

I tried a few things but failed. any suggestion?
@travisn @BlaineEXE

Note, I can use os.stdout, os.stderr but want to return the output from the method and not just log in to the terminal.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I need help getting the right error message rather than just the error code from the below code

func ExecuteBashCommand(command string) string {
cmd := exec.Command("/bin/bash",
"-x", // Print commands and their arguments as they are executed
"-e", // Exit immediately if a command exits with a non-zero status.
"-m", // Terminal job control, allows job to be terminated by SIGTERM
"-c", // Command to run
command,
)
stdout, err := cmd.Output()
if err != nil {
logging.Fatal(err)
}
return string(stdout)
}

for example: when running a kubectl command to create a resource that already exits, the above code throughs exit code 1 and then exits but what I really want is to print the Kubernetes error that Already exits ...... and then exit.

I tried a few things but failed. any suggestion? @travisn @BlaineEXE

Note, I can use os.stdout, os.stderr but want to return the output from the method and not just log in to the terminal.

Do you see the "Already exists" error code in the stderr? Do we just need to look for that string since no error code is returned?

// RestoreCmd represents the restore commands
var RestoreCmd = &cobra.Command{
Use: "restore",
Short: "Reads the crd and crd name to restore. Ex: restore cephcluster my-cluster",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it may be better not to use the Rook API since then the CRD settings would be tied to the version of the API that this plugin is referencing. For this scenario, we need to make a copy of the CR, then delete it, then create it again. So we need to make sure all settings are preserved, independent of the Rook version. So if we can use more generic queries for the CRs it will be better.

@subhamkrai subhamkrai marked this pull request as ready for review September 6, 2023 10:02
@subhamkrai
Copy link
Collaborator Author

Testing result

Test Passing argument cr and crName

 ./bin/kubectl-rook-ceph restore cephblockpool replicapool
Warning: rook version 'rook: v1.12.0-alpha.0.227.g3e248bf9a' is running a pre-release version of Rook.

Info: Restoring crd cephblockpool name replicapool
Info: Scaling down the operator to 0
Info: Backing up kubernetes and crd resources
Info: Backed up crd cephblockpool named replicapool in file cephblockpool.yaml
Info: Backed up secret in file secret.yaml
Info: Backed up configmap in file configmap.yaml
Info: deleting validating webhook rook-ceph-webhook if present
Info: Fetching cephblockpool replicapool uid
Info: Successfully fetched uid 0f8b59df-816d-4cc8-b888-ec128b14c308 from cephblockpool replicapool
Info: removing ownerreferences from resources with matching uid 0f8b59df-816d-4cc8-b888-ec128b14c308
Info: removing finalizers from cephblockpool replicapool
Info: cephblockpool.ceph.rook.io/replicapool patched

Info: re-creating the cr cephblockpool from file replicapool.yaml created above
Info: cephblockpool.ceph.rook.io/replicapool created

Info: Scaling up the operator to 1
Info: CR should be successfully restored. Please watch the operator logs and check the crd
~/go/src/github.com/kubectl-rook-ceph
~/go/src/github.com/kubectl-rook-ceph
srai@fedora ~ (restore-crd) $ kc get cephblockpool
NAME          PHASE
replicapool   Progressing
~/go/src/github.com/kubectl-rook-ceph
srai@fedora ~ (restore-crd) $ kc get cephblockpool
NAME          PHASE
replicapool   Ready

Test passing only cephcluster

./bin/kubectl-rook-ceph  restore cephcluster
Warning: rook version 'rook: v1.12.0-alpha.0.227.g3e248bf9a' is running a pre-release version of Rook.

Info: Restoring crd cephcluster name my-cluster
Info: Scaling down the operator to 0
Info: Backing up kubernetes and crd resources
Info: Backed up crd cephcluster named my-cluster in file cephcluster.yaml
Info: Backed up secret in file secret.yaml
Info: Backed up configmap in file configmap.yaml
Info: deleting validating webhook rook-ceph-webhook if present
Info: Fetching cephcluster my-cluster uid
Info: Successfully fetched uid a20a0b18-52d5-4844-a542-782e2d224f98 from cephcluster my-cluster
Info: removing ownerreferences from resources with matching uid a20a0b18-52d5-4844-a542-782e2d224f98
Info: removing owner references for service rook-ceph-exporter
Info: Removed ownerReference for service: rook-ceph-exporter

Info: removing finalizers from cephcluster my-cluster
Info: cephcluster.ceph.rook.io/my-cluster patched

Info: re-creating the cr cephcluster from file my-cluster.yaml created above
Info: cephcluster.ceph.rook.io/my-cluster created

Info: Scaling up the operator to 1
Info: CR should be successfully restored. Please watch the operator logs and check the crd
~/go/src/github.com/kubectl-rook-ceph
srai@fedora ~ (restore-crd) $ kc get cephcluster
NAME         DATADIRHOSTPATH   MONCOUNT   AGE     PHASE   MESSAGE                        HEALTH      EXTERNAL   FSID
my-cluster   /var/lib/rook     1          7m44s   Ready   Cluster created successfully   HEALTH_OK              5dc8aca4-cb12-47ee-a83e-a9f773b39a0a

@subhamkrai
Copy link
Collaborator Author

Also, I need help getting the right error message rather than just the error code from the below code

func ExecuteBashCommand(command string) string {
cmd := exec.Command("/bin/bash",
"-x", // Print commands and their arguments as they are executed
"-e", // Exit immediately if a command exits with a non-zero status.
"-m", // Terminal job control, allows job to be terminated by SIGTERM
"-c", // Command to run
command,
)
stdout, err := cmd.Output()
if err != nil {
logging.Fatal(err)
}
return string(stdout)
}

for example: when running a kubectl command to create a resource that already exits, the above code throughs exit code 1 and then exits but what I really want is to print the Kubernetes error that Already exits ...... and then exit.
I tried a few things but failed. any suggestion? @travisn @BlaineEXE
Note, I can use os.stdout, os.stderr but want to return the output from the method and not just log in to the terminal.

Do you see the "Already exists" error code in the stderr? Do we just need to look for that string since no error code is returned?

yes, In the error stderr I saw the right error string but after checking the error from cmd. output it has error code.
Anyway, everything is working as expected now. I had to add cmd.Stderr = os.Stderr and had to remove few args when running command directly using os/exec

@subhamkrai
Copy link
Collaborator Author

@travisn I was thinking of adding CI test for this as another pr or do you think I should include it in this pr only?

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travisn I was thinking of adding CI test for this as another pr or do you think I should include it in this pr only?

Let's have the CI test with this PR. It's critical that we test this feature fully before anyone tries to use it.

pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
cmd/commands/restore.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
@subhamkrai
Copy link
Collaborator Author

@travisn @Madhu-1 Let's discuss the error in this thread.

So, currently, the flow is if we get any error after scaling down to the rook operator pod, and before scaling up the rook operator pod, I'm scaling back the rook operator first and then returning the error.

What I'm gathering from the comments now are,

  1. We first want to return the error first and scale back the rook operator
  2. We'll leave it to the users to scale back the rook operator.

Let me know your thoughts.

@Madhu-1
Copy link
Member

Madhu-1 commented Sep 7, 2023

@travisn @Madhu-1 Let's discuss the error in this thread.

So, currently, the flow is if we get any error after scaling down to the rook operator pod, and before scaling up the rook operator pod, I'm scaling back the rook operator first and then returning the error.

What I'm gathering from the comments now are,

  1. We first want to return the error first and scale back the rook operator
  2. We'll leave it to the users to scale back the rook operator.

Let me know your thoughts.

IMHO scaling back the operator makes sense only if it is scaled down by the tool not by the user, the user might have scaled it down for some other reasons.

@subhamkrai
Copy link
Collaborator Author

@travisn @Madhu-1 Let's discuss the error in this thread.
So, currently, the flow is if we get any error after scaling down to the rook operator pod, and before scaling up the rook operator pod, I'm scaling back the rook operator first and then returning the error.
What I'm gathering from the comments now are,

  1. We first want to return the error first and scale back the rook operator
  2. We'll leave it to the users to scale back the rook operator.

Let me know your thoughts.

IMHO scaling back the operator makes sense only if it is scaled down by the tool not by the user, the user might have scaled it down for some other reasons.

Yes, for this command tool scale down the operator

@subhamkrai subhamkrai force-pushed the restore-crd branch 6 times, most recently from 180be3d to 63b6e66 Compare September 7, 2023 15:29
)

var getCrName = `
kubectl -n %s get %s -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.deletionGracePeriodSeconds}{"\n"}{end}' | awk '$2=="0" {print $1}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, the tool is doesn't have the capability of finding if it kubectl or oc. But I think it will be better if in the future if the tool is able to find the differences.

If we can't autodetect it yet, we need a CLI arg that lets them override it to oc.

pkg/restore/crd.go Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
pkg/restore/crd.go Outdated Show resolved Hide resolved
pkg/restore/crd.go Outdated Show resolved Hide resolved
pkg/restore/crd.go Outdated Show resolved Hide resolved
pkg/restore/restore_crd.go Outdated Show resolved Hide resolved
.github/workflows/go-test.yaml Outdated Show resolved Hide resolved
pkg/restore/crd.go Outdated Show resolved Hide resolved
pkg/restore/crd.go Outdated Show resolved Hide resolved
.github/workflows/go-test.yaml Show resolved Hide resolved
@subhamkrai
Copy link
Collaborator Author

depends #180

cmd/commands/restore.go Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
pkg/restore/crd.go Outdated Show resolved Hide resolved
pkg/restore/crd.go Show resolved Hide resolved
pkg/restore/crd.go Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
pkg/restore/crd.go Show resolved Hide resolved
pkg/restore/crd.go Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
@subhamkrai subhamkrai force-pushed the restore-crd branch 3 times, most recently from eb37fb2 to 4cd49d0 Compare September 27, 2023 04:10
docs/crd.md Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
pkg/restore/crd.go Show resolved Hide resolved
pkg/mons/restore_quorum.go Outdated Show resolved Hide resolved

// RestoreCmd represents the restore commands
var RestoreCmd = &cobra.Command{
Use: "restore-deleted",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Use: "restore-deleted",
Use: "restore-deleted",

any suggestion on the name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undelete is another option. restore-deleted still sounds better to me, but we could take a vote. Other thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with restore-deleted


// RestoreCmd represents the restore commands
var RestoreCmd = &cobra.Command{
Use: "restore-deleted",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undelete is another option. restore-deleted still sounds better to me, but we could take a vote. Other thoughts?

docs/crd.md Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
docs/crd.md Outdated Show resolved Hide resolved
pkg/mons/restore_quorum.go Outdated Show resolved Hide resolved
pkg/mons/restore_quorum.go Outdated Show resolved Hide resolved
pkg/mons/restore_quorum.go Outdated Show resolved Hide resolved
pkg/restore/crd.go Show resolved Hide resolved
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small suggestions

docs/crd.md Outdated
While the underlying Ceph data and daemons continue to be available, the CRs will be stuck indefinitely in a Deleting state in which the operator will not continue to ensure cluster health. Upgrades will be blocked, further updates to the CRs are prevented, and so on. Since Kubernetes does not allow undeleting resources, the command below will allow repairing the CRs without even necessarily suffering cluster downtime.

> [!NOTE]
> If there are multiple deleted resources in the cluster and no specific resource is mentioned, in that case the first resource will be restored and in order to restore all resources we need to re-run the command.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> If there are multiple deleted resources in the cluster and no specific resource is mentioned, in that case the first resource will be restored and in order to restore all resources we need to re-run the command.
> If there are multiple deleted resources in the cluster and no specific resource is mentioned, the first resource will be restored. To restore all deleted resources, re-run the command multiple times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, could we just loop through all the resources?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could loop through all the resources but IMO it's better to restore one resource at a time. It's better for monitoring and debugging if something goes wrong. Also, users may get overwhelmed with all the log messages.

Also, we later decide to restore all the resources at once. In that case, it will be better if we do so after Javier's PR since he is adding the code to list resources from dynamic API, it will help in looping over the resource. And currently we are getting the resource from the command and then reading it as a string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok we can do this for now, it's really a corner case anyway.

pkg/restore/crd.go Outdated Show resolved Hide resolved
pkg/restore/crd.go Outdated Show resolved Hide resolved
pkg/restore/crd.go Outdated Show resolved Hide resolved
pkg/restore/crd.go Outdated Show resolved Hide resolved
with these changes we'll be able to restore any
crd which stuck deletion due some dependencies
after after accidental deletion.
example:
kubectl rook-ceph -n <ns> restore <cr> <cr_name>

Signed-off-by: subhamkrai <srai@redhat.com>
@subhamkrai subhamkrai merged commit 7e8ae3e into rook:master Oct 17, 2023
5 checks passed
@subhamkrai subhamkrai deleted the restore-crd branch October 17, 2023 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Krew command for recovering the cephcluster CR after accidental deletion
4 participants