-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: remove workload finalizer if workload is finished #1523
Fix: remove workload finalizer if workload is finished #1523
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
@@ -133,8 +134,8 @@ func (r *WorkloadReconciler) Reconcile(ctx context.Context, req ctrl.Request) (c | |||
ctx = ctrl.LoggerInto(ctx, log) | |||
log.V(2).Info("Reconciling Workload") | |||
|
|||
if apimeta.IsStatusConditionTrue(wl.Status.Conditions, kueue.WorkloadFinished) { | |||
return ctrl.Result{}, nil | |||
if apimeta.IsStatusConditionTrue(wl.Status.Conditions, kueue.WorkloadFinished) && len(wl.ObjectMeta.OwnerReferences) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to check if the workload has ownerReference?
Will Argo remove ownerReference from the workload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If workload has an owner reference, there's a possibility that the owner job should be stopped:
err := r.stopJob(ctx, job, wl, StopReasonWorkloadDeleted, "Workload is deleted") |
We will introduce a race condition between the workload and job controller if we don't check for the owner references here.
And the workload is finalized here for the uncommon scenario, when the workload has no owner references, so the reconcile for respective job could not be queued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should look more like this:
if len(owners) == 0 && deletionTimestamp != nil {
finalize()
}
if finished {
return
}
Note that workloads might not have owners in two cases:
- When using prebuilt workload Prebuilt workload suport #1358
- When using orphan cascade deletion https://kubernetes.io/docs/tasks/administer-cluster/use-cascading-deletion/#set-orphan-deletion-policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achernevskii Thank you for the clarifications!
As @alculquicondor mentioned, there are some positive situations when the workload doesn't have
ownerReference. So, I think that we should apply the Aldo's suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the behavior in f0690dc
Type: "Finished", | ||
Status: "True", | ||
}). | ||
Obj(), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a case for the workload with ownerReference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Obj(), | ||
). | ||
Admitted(true). | ||
Delete(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we verify in
} |
deletionTimestamp
instead of adding such a deletionTimestamp?
pkg/util/workload/workload.go
Outdated
kueue "sigs.k8s.io/kueue/apis/kueue/v1beta1" | ||
) | ||
|
||
func RemoveWorkloadFinalizer(ctx context.Context, c client.Client, wl *kueue.Workload) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this function to the workload
(pkg/workload) package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in f0690dc
@achernevskii Also, can you fix the CI error? |
/cc |
f0690dc
to
0a44617
Compare
/assign @trasc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm
Status: "True", | ||
}). | ||
OwnerReference(batchv1.SchemeGroupVersion.String(), "Job", "job", "test-uid", true, true). | ||
Delete(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing the deletionTimestamp
would cause a flaky test?
So, can we verify if the workload has the deletionTImestamp
in Line 427 instead of comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can use EquateApproxTime? or a AcyclicTransformer that only considers whether the values are nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably use EquateApproxTime
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just provide the desired time to the wrapper
pkg/util/testing/wrappers.go
Outdated
// Delete sets a deletion timestamp for the pod object | ||
func (w *WorkloadWrapper) Delete() *WorkloadWrapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Delete sets a deletion timestamp for the pod object | |
func (w *WorkloadWrapper) Delete() *WorkloadWrapper { | |
// DeletionTimestamp sets a deletion timestamp for the pod object | |
func (w *WorkloadWrapper) DeletionTimestamp() *WorkloadWrapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/util/testing/wrappers.go
Outdated
t := metav1.NewTime(time.Now()).Rfc3339Copy() | ||
w.Workload.DeletionTimestamp = &t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t := metav1.NewTime(time.Now()).Rfc3339Copy() | |
w.Workload.DeletionTimestamp = &t | |
w.Workload.DeletionTimestamp := ptr.To(metav1.NewTime(time.Now()).Rfc3339Copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this Rfc3339Copy
doing? truncating to second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this
Rfc3339Copy
doing? truncating to second?
Correct, had to add it because of kubernetes/kubernetes#81026
pkg/util/testing/wrappers.go
Outdated
t := metav1.NewTime(time.Now()).Rfc3339Copy() | ||
w.Workload.DeletionTimestamp = &t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this Rfc3339Copy
doing? truncating to second?
} | ||
|
||
if err == nil && wl != nil { | ||
err = r.removeFinalizer(ctx, wl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change this? how will this workload be finalized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
Status: "True", | ||
}). | ||
OwnerReference(batchv1.SchemeGroupVersion.String(), "Job", "job", "test-uid", true, true). | ||
Delete(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can use EquateApproxTime? or a AcyclicTransformer that only considers whether the values are nil.
160e42f
to
acbed0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 1f5b261905a2aafebc4e6108ac5fc65f9e5d10dc
|
/hold |
Please squash, so that cherry-picking is less cumbersome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Please squash, so that cherry-picking is less cumbersome.
@alculquicondor Isn't the default merge method squash?
/cherry-pick release-0.5 |
@tenzen-y: once the present PR merges, I will cherry-pick it on top of release-0.5 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
It is, but the cherry-pick bot still goes by the original commits. Not sure if the individual commits could run into conflicts where the squashed commit wouldn't |
I see. I'm fine with manually squash. |
acbed0c
to
a5181ea
Compare
squashed, but somehow it detected code changes due to rebase. Needs LGTM. /unhold |
/lgtm |
LGTM label has been added. Git tree hash: 352be45fd720ee5036953f279568a014bda84ee7
|
@tenzen-y: #1523 failed to apply on top of branch "release-0.5":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@tenzen-y do you think it's that important to cherry-pick this? It's a bit of an edge case. |
@alculquicondor I think that cherry-picking might be better since we have a chance to release a patch version (v0.5.2). Are you concerned about manyconflicts? |
@trasc can you try to get a manual cherry-pick? |
Co-authored-by: Lukas Wöhrl <lukas.woehrl@plentymarkets.com>
/release-note-edit
|
/release-note-edit
|
What type of PR is this?
/kind bug
What this PR does / why we need it:
Workload will be finalized if it has a Finished condition and no OwnerReferences.
Which issue(s) this PR fixes:
Fixes #1450
Special notes for your reviewer:
Does this PR introduce a user-facing change?