-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Remove dataImage finalizer if BMH is missing #1974
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@eranco74: adding LGTM is restricted to approvers and reviewers in OWNERS files. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This is an interesting edge case. Is it because of a forced deletion of the BMH? |
Do you mean removing BMH's finalizer ? I tested with both - removing the finalizer and waiting for BMH to deprovision, after triggering BMH delete by running |
At least this case should be handled by the BMH controller, not the DataImage controller, I think. ALso, don't we disconnect the ISO on a "normal" deprovisioning? |
Why does that make more sense, won't a force deletion of BMH ( by removing the finalizer ) leave the dataImage orphaned in that case ? |
Your PR handles forced deletion, but I don't think it handles the normal deprovisioning case? The BMH controller must decide what to do with DataImages it owns. Either remove the finalizer (esp. when the node is detached) or disconnect the ISO first? |
Makes sense, I will work on the detached and normal delete process. What about the forced deletion scenario, should I keep the current DataImage controller fix in place ? |
not forced, BMH deletion isn't blocked on dataImage existence or attachment. |
I think it's a fair to remove the finalizer if the controller sees that the owner BMH is gone. |
c81a3bb
to
c1a83e1
Compare
lgtm, but could we have a couple of e2e tests related to this? |
lgtm, maybe a separate PR with the e2e would be ok (I do think it would be important) |
Right, I can add the e2e tests in a separate PR. |
/approve |
#2053 so they're not forgotten, or at least are tracked. |
@@ -520,6 +526,11 @@ func (r *BareMetalHostReconciler) actionDeleting(prov provisioner.Provisioner, i | |||
"timestamp", info.host.DeletionTimestamp, | |||
) | |||
|
|||
// Deleting DataImage if it exists for the BMH | |||
if err := r.removeFinalizerDataImage(info); err != nil { | |||
info.log.Error(err, "Failed to delete DataImage") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we ignore the error here? I'm fine with ignoring HTTP 404 (in case somebody did a nasty force-deletion), but it seems that ignoring any errors may result with orphaned DataImage.
@@ -593,6 +604,11 @@ func hasCustomDeploy(host *metal3api.BareMetalHost) bool { | |||
|
|||
// detachHost() detaches the host from the Provisioner. | |||
func (r *BareMetalHostReconciler) detachHost(prov provisioner.Provisioner, info *reconcileInfo) actionResult { | |||
// Remove DataImage finalizer if the host has been detached | |||
if err := r.removeFinalizerDataImage(info); err != nil { | |||
info.log.Info("Failed to remove DataImage finalizer, %w", "Error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
Also, %w does nothing, there is no string interpolation here.
@@ -1493,6 +1513,21 @@ func (r *BareMetalHostReconciler) handleDataImageActions(prov provisioner.Provis | |||
return actionContinue{} | |||
} | |||
|
|||
bmhProvState := info.host.Status.Provisioning.State |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary variable
if isProvisioned { | ||
// If the host is provisioned/externallyProvisioned or if its in | ||
// the available state after deprovisioning. | ||
if isProvisioned || isAvailable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm not sure we can enter this function in any other state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about it, I don't believe using Available is the right thing to do. The image won't be disconnected until a reboot, which will happen.. who knows when. I think we should start deprovisioning with deleting a data image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean just deleting the data image and not detaching it, right ?
@@ -1493,6 +1513,21 @@ func (r *BareMetalHostReconciler) handleDataImageActions(prov provisioner.Provis | |||
return actionContinue{} | |||
} | |||
|
|||
bmhProvState := info.host.Status.Provisioning.State | |||
isBMHAvailable := bmhProvState == metal3api.StateAvailable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer this function to have an argument like detachOnly bool
and keep the provision state logic at the caller side (where you already check for the Available state).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will make the changes.
@@ -1493,6 +1513,21 @@ func (r *BareMetalHostReconciler) handleDataImageActions(prov provisioner.Provis | |||
return actionContinue{} | |||
} | |||
|
|||
bmhProvState := info.host.Status.Provisioning.State | |||
isBMHAvailable := bmhProvState == metal3api.StateAvailable | |||
hostDeleteTriggered := !info.host.DeletionTimestamp.IsZero() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
if attachedURL != "" || dirty { | ||
info.log.Info("Detaching DataImage as it was deleted") | ||
info.log.Info("Detaching DataImage, Reason = %w", "Reason", detachReason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The %w part will be logged verbatim, there is no string interpolation here.
detachReason = "Delete DataImage" | ||
} | ||
if isBMHAvailable { | ||
detachReason = "BMH Deprovision finished, node in available state" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to leave a stale DataImage object in this case? If yes, it will behave unpredictably on the next deployment because it will still have a non-empty URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will also set the spec.url: ""
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments inline.
On top of that, I think this PR tries to solve at least 3 related but distinct problem (removing DataImage on detachment, on deletion, fixing orphaned DataImage objects). I'd rather have 3 PRs.
Signed-off-by: Himanshu Roy <hroy@redhat.com>
c1a83e1
to
62c7b19
Compare
I have limited this PR to deleting orphaned DataImage ( BMH missing ) and reducing the reconcile retry interval for detached hosts. I will create separate PRs for the other issues. |
Signed-off-by: Himanshu Roy <hroy@redhat.com> (cherry picked from commit be67a8912b5d3a0f0f3968bff02279e6ea24561a)
62c7b19
to
f634fec
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dtantsur, elfosardo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
This fixes the issue of dataImage resource not being deleted, if the corresponding BMH is deleted without detaching the dataImage.