Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Remove dataImage finalizer if BMH is missing #1974

Merged
merged 2 commits into from
Nov 28, 2024

Conversation

hroyrh
Copy link
Member

@hroyrh hroyrh commented Sep 27, 2024

This fixes the issue of dataImage resource not being deleted, if the corresponding BMH is deleted without detaching the dataImage.

@metal3-io-bot metal3-io-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 27, 2024
Copy link

@eranco74 eranco74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@metal3-io-bot
Copy link
Contributor

@eranco74: adding LGTM is restricted to approvers and reviewers in OWNERS files.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dtantsur
Copy link
Member

This is an interesting edge case. Is it because of a forced deletion of the BMH?

@hroyrh
Copy link
Member Author

hroyrh commented Sep 30, 2024

This is an interesting edge case. Is it because of a forced deletion of the BMH?

Do you mean removing BMH's finalizer ? I tested with both - removing the finalizer and waiting for BMH to deprovision, after triggering BMH delete by running kubectl delete bmh node-name. In both cases, the DataImage was left orphaned.

@dtantsur
Copy link
Member

waiting for BMH to deprovision, after triggering BMH delete by running kubectl delete bmh node-name

At least this case should be handled by the BMH controller, not the DataImage controller, I think. ALso, don't we disconnect the ISO on a "normal" deprovisioning?

@hroyrh
Copy link
Member Author

hroyrh commented Oct 1, 2024

At least this case should be handled by the BMH controller, not the DataImage controller, I think. ALso, don't we disconnect the ISO on a "normal" deprovisioning?

Why does that make more sense, won't a force deletion of BMH ( by removing the finalizer ) leave the dataImage orphaned in that case ?
I need to verify whether ISO disconnection happens on a "normal" deprovisioning.

@dtantsur
Copy link
Member

dtantsur commented Oct 1, 2024

Why does that make more sense, won't a force deletion of BMH ( by removing the finalizer ) leave the dataImage orphaned in that case ?

Your PR handles forced deletion, but I don't think it handles the normal deprovisioning case? The BMH controller must decide what to do with DataImages it owns. Either remove the finalizer (esp. when the node is detached) or disconnect the ISO first?

@hroyrh
Copy link
Member Author

hroyrh commented Oct 1, 2024

Your PR handles forced deletion, but I don't think it handles the normal deprovisioning case? The BMH controller must decide what to do with DataImages it owns. Either remove the finalizer (esp. when the node is detached) or disconnect the ISO first?

Makes sense, I will work on the detached and normal delete process. What about the forced deletion scenario, should I keep the current DataImage controller fix in place ?

@eranco74
Copy link

eranco74 commented Oct 1, 2024

ested with both - removing the finalizer and waiting for BMH to deprovision, after triggering BMH delete by running kub

not forced, BMH deletion isn't blocked on dataImage existence or attachment.
You can delete a BMH attached to dataImage, but then you end up with a dataImage you can't delete

@dtantsur
Copy link
Member

dtantsur commented Oct 1, 2024

What about the forced deletion scenario, should I keep the current DataImage controller fix in place ?

I think it's a fair to remove the finalizer if the controller sees that the owner BMH is gone.

@hroyrh hroyrh force-pushed the dataimage-delete-fix branch from c81a3bb to c1a83e1 Compare November 6, 2024 07:41
@metal3-io-bot metal3-io-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 6, 2024
@mquhuy
Copy link
Member

mquhuy commented Nov 6, 2024

lgtm, but could we have a couple of e2e tests related to this?

@iurygregory
Copy link
Member

lgtm, maybe a separate PR with the e2e would be ok (I do think it would be important)

@hroyrh
Copy link
Member Author

hroyrh commented Nov 8, 2024

Right, I can add the e2e tests in a separate PR.

@elfosardo
Copy link
Member

/approve
let's add the e2e tests as soon as this merges

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 12, 2024
@tuminoid
Copy link
Member

Right, I can add the e2e tests in a separate PR.

#2053 so they're not forgotten, or at least are tracked.

@@ -520,6 +526,11 @@ func (r *BareMetalHostReconciler) actionDeleting(prov provisioner.Provisioner, i
"timestamp", info.host.DeletionTimestamp,
)

// Deleting DataImage if it exists for the BMH
if err := r.removeFinalizerDataImage(info); err != nil {
info.log.Error(err, "Failed to delete DataImage")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we ignore the error here? I'm fine with ignoring HTTP 404 (in case somebody did a nasty force-deletion), but it seems that ignoring any errors may result with orphaned DataImage.

@@ -593,6 +604,11 @@ func hasCustomDeploy(host *metal3api.BareMetalHost) bool {

// detachHost() detaches the host from the Provisioner.
func (r *BareMetalHostReconciler) detachHost(prov provisioner.Provisioner, info *reconcileInfo) actionResult {
// Remove DataImage finalizer if the host has been detached
if err := r.removeFinalizerDataImage(info); err != nil {
info.log.Info("Failed to remove DataImage finalizer, %w", "Error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

Also, %w does nothing, there is no string interpolation here.

@@ -1493,6 +1513,21 @@ func (r *BareMetalHostReconciler) handleDataImageActions(prov provisioner.Provis
return actionContinue{}
}

bmhProvState := info.host.Status.Provisioning.State
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary variable

if isProvisioned {
// If the host is provisioned/externallyProvisioned or if its in
// the available state after deprovisioning.
if isProvisioned || isAvailable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not sure we can enter this function in any other state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about it, I don't believe using Available is the right thing to do. The image won't be disconnected until a reboot, which will happen.. who knows when. I think we should start deprovisioning with deleting a data image.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean just deleting the data image and not detaching it, right ?

@@ -1493,6 +1513,21 @@ func (r *BareMetalHostReconciler) handleDataImageActions(prov provisioner.Provis
return actionContinue{}
}

bmhProvState := info.host.Status.Provisioning.State
isBMHAvailable := bmhProvState == metal3api.StateAvailable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer this function to have an argument like detachOnly bool and keep the provision state logic at the caller side (where you already check for the Available state).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I will make the changes.

@@ -1493,6 +1513,21 @@ func (r *BareMetalHostReconciler) handleDataImageActions(prov provisioner.Provis
return actionContinue{}
}

bmhProvState := info.host.Status.Provisioning.State
isBMHAvailable := bmhProvState == metal3api.StateAvailable
hostDeleteTriggered := !info.host.DeletionTimestamp.IsZero()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

if attachedURL != "" || dirty {
info.log.Info("Detaching DataImage as it was deleted")
info.log.Info("Detaching DataImage, Reason = %w", "Reason", detachReason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The %w part will be logged verbatim, there is no string interpolation here.

detachReason = "Delete DataImage"
}
if isBMHAvailable {
detachReason = "BMH Deprovision finished, node in available state"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to leave a stale DataImage object in this case? If yes, it will behave unpredictably on the next deployment because it will still have a non-empty URL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will also set the spec.url: "".

Copy link
Member

@dtantsur dtantsur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments inline.

On top of that, I think this PR tries to solve at least 3 related but distinct problem (removing DataImage on detachment, on deletion, fixing orphaned DataImage objects). I'd rather have 3 PRs.

Signed-off-by: Himanshu Roy <hroy@redhat.com>
@hroyrh hroyrh force-pushed the dataimage-delete-fix branch from c1a83e1 to 62c7b19 Compare November 22, 2024 11:42
@metal3-io-bot metal3-io-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 22, 2024
@hroyrh
Copy link
Member Author

hroyrh commented Nov 22, 2024

I have limited this PR to deleting orphaned DataImage ( BMH missing ) and reducing the reconcile retry interval for detached hosts. I will create separate PRs for the other issues.

Signed-off-by: Himanshu Roy <hroy@redhat.com>
(cherry picked from commit be67a8912b5d3a0f0f3968bff02279e6ea24561a)
@hroyrh hroyrh force-pushed the dataimage-delete-fix branch from 62c7b19 to f634fec Compare November 22, 2024 11:44
@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtantsur, elfosardo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dtantsur
Copy link
Member

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2024
@metal3-io-bot metal3-io-bot merged commit db01dbd into metal3-io:main Nov 28, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants