Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1820472: Finalize namespace when it is stuck in 'Terminating' after migration #633

Closed
wants to merge 1 commit into from

Conversation

pliurh
Copy link
Contributor

@pliurh pliurh commented May 11, 2020

As controller-runtime client has yet support subresources like spec\finalizers,
using client-go to invoke the Finalize API of Namespace.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label May 11, 2020
@openshift-ci-robot
Copy link
Contributor

@pliurh: This pull request references Bugzilla bug 1820472, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1820472: Finalize namespace when it is stuck in 'Terminating' after migration

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label May 11, 2020
@openshift-ci-robot
Copy link
Contributor

@pliurh: This pull request references Bugzilla bug 1820472, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1820472: Finalize namespace when it is stuck in 'Terminating' after migration

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Finalize namespace when it is stuck in 'Terminating' after migration.
As controller-runtime client has yet support subresources like,
using client-go to invoke the Finalize API of Namespace.
@pliurh
Copy link
Contributor Author

pliurh commented May 11, 2020

/test e2e-ovn-hybrid-step-registry

if err != nil {
return errors.Wrapf(err, "could not create clientset")
}
if _, err := clientset.CoreV1().Namespaces().Finalize(ns); err != nil {
Copy link
Contributor

@juanluisvaladas juanluisvaladas May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong, but I think you can do instead:
if _, err := client.CoreV1().Namespaces().Finalize(ns); err

And avoid instantiating a new client, removing the lines 64-72

Edit: ah no you can't, disregard this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client here is not a k8s client-go clientset, but a controller-runtime client which doesn't support the finalize call of the namespace object. kubernetes-sigs/controller-runtime#573

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not going to work. He wants to access the Finalize method, which is not exposed by k8sclient.Client, but there might be a smoother way to do this than instantiating a new clientset like this?

@juanluisvaladas
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: juanluisvaladas, pliurh
To complete the pull request process, please assign knobunc
You can assign the PR to them by writing /assign @knobunc in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alexanderConstantinescu
Copy link
Contributor

I am not sure convinced this should be done in ApplyObject, we already have a place for cleaning up old resources which are not deployed by the CNO anymore, that's done in deleteRelatedObjectsNotRendered in status_manager.go. Maybe adapt that method to do this?

}
if ns.Status.Phase == v1.NamespaceTerminating {
log.Printf("finalize Namespace %s", objDesc)
ns.Spec.Finalizers = []v1.FinalizerName{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hang on a moment - you can't just go and delete all finalizers. You should only delete the finalizer for which you are responsible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic will only be triggered when user hit a problem with OVN after the migration and node reboot. At that time all the openshift API is not available (as OVN doesn't work), so the finalizer of openshift-sdn namespace cannot be removed. In such situation, the user cannot rollback, as the openshift-sdn is hung in the Terminating, new objects cannot be recreated.
So either we ask user to manually invoke the finalize API before rolling back, or we add this logic in the code before recreating the openshift-sdn namespace.

@squeed
Copy link
Contributor

squeed commented May 13, 2020

/hold
I don't think this is quite right - removing all finalizers is pretty dangerous.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 13, 2020
@pliurh
Copy link
Contributor Author

pliurh commented May 13, 2020

I am not sure convinced this should be done in ApplyObject, we already have a place for cleaning up old resources which are not deployed by the CNO anymore, that's done in deleteRelatedObjectsNotRendered in status_manager.go. Maybe adapt that method to do this?

Normally, it will work. However, if the OVN doesn't work after the migration. The finalizers in the namespace cannot be removed as expected. And the namespaces will be stuck in Terminating forever, which prevent user from rolling back to openshift-sdn.

@alexanderConstantinescu
Copy link
Contributor

alexanderConstantinescu commented May 13, 2020

I am not sure convinced this should be done in ApplyObject, we already have a place for cleaning up old resources which are not deployed by the CNO anymore, that's done in deleteRelatedObjectsNotRendered in status_manager.go. Maybe adapt that method to do this?

Normally, it will work. However, if the OVN doesn't work after the migration. The finalizers in the namespace cannot be removed as expected. And the namespaces will be stuck in Terminating forever, which prevent user from rolling back to openshift-sdn.

I understand what you mean, I read the BZ. I am not saying "you should not do this", rather: "you might think about not doing it there and instead in deleteRelatedObjectsNotRendered" - because that's where we cleanup old resources left around from previous deployments (which the openshift-sdn namespace qualifies as, after an upgrade to ovn-kubernetes)

@squeed
Copy link
Contributor

squeed commented May 13, 2020

Yeah, as written this can't be merged. We cannot just go deleting all namespace finalizers. That will cause all sorts of unpleasantness, like possibly orphaned pods.

@pliurh
Copy link
Contributor Author

pliurh commented May 13, 2020

Let me groom the logic a bit. In this patch finalize call will only be invoked, when the openshift-sdn namespace is stuck in the Terminating state, AND the user chose to rollback. So the openshift-sdn will be rendered and created by CNO again. So it will neither affect namespace in a normal state nor after successfully migration when the namespace can be deleted as expected.

The reason why we shall not put this logic in deleteRelatedObjectsNotRendered is that, we cannot tell whether the openshift-sdn is stuck in Terminating forever which requires a force deletion, OR it's just waiting for all the resources in the namespace being deleted.

@pliurh
Copy link
Contributor Author

pliurh commented May 13, 2020

Yeah, as written this can't be merged. We cannot just go deleting all namespace finalizers. That will cause all sorts of unpleasantness, like possibly orphaned pods.

There will be no orphaned pods, all the openshift-sdn pods will be removed before the reboot. And this finalized shall only be triggered after the reboot by the rollback operation. However, it could be an issue for the openshift API controlled objects under openshift-sdn namespace. To solve that, either

  1. we don't delete the namespace during the migration,
  2. or we manipulate the order of object deletion in deleteRelatedObjectsNotRendered, to make sure the openshift-sdn pods won't be deleted before all the openshift-api objects get deleted.

But I don't know whether option 2 is practical or not.

@pliurh
Copy link
Contributor Author

pliurh commented May 15, 2020

replace by #641

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants