-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding toleration to the job doesn't trigger workload change #1304
Adding toleration to the job doesn't trigger workload change #1304
Conversation
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
d446ad9
to
1a99a26
Compare
pkg/util/equality/podset.go
Outdated
@@ -26,30 +26,33 @@ import ( | |||
|
|||
// TODO: Revisit this, maybe we should extend the check to everything that could potentially impact | |||
// the workload scheduling (priority, nodeSelectors(when suspended), tolerations and maybe more) | |||
func comparePodTemplate(a, b *corev1.PodSpec) bool { | |||
func comparePodTemplate(a, b *corev1.PodSpec, checkCount, changePodSpecFields bool) bool { | |||
if changePodSpecFields && !equality.Semantic.DeepEqual(a.Tolerations, b.Tolerations) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tolerations could change when the job is unsuspended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a unit test for the job controller where we can see that the Workload object is updated when you change the toleration.
And another one where if you change the toleration for a Job that is admitted, the Workload gets evicted.
Is this a very common use-case, I think is easier to make the tolerations immutable while the job is not suspended. |
Tolerations are already immutable for k8s Jobs when not suspended https://kubernetes.io/docs/concepts/workloads/controllers/job/#mutable-scheduling-directives But that's not true for all CRDs. We might be able to add the restriction as webhooks for jobs we have integrations for, but we can't control other implementations. So worth having the safe guard in the reconciler. |
b3cd54d
to
2a88403
Compare
Done, however the overall experience for the case when the tolerations are changed while admitted is at least strange, the job is suspended and the workload is removed but the tolerations from the workload are restored in the job ... |
d3fb537
to
2a88403
Compare
Obj(), | ||
}, | ||
}, | ||
"the workload is admitted and tolerations change": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there already a case where the workload is admitted, but the job is still suspended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"suspended job with matching admitted workload is unsuspended" .. why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to cover what I describe above: "a suspended job should match either the running or base specs".
If it doesn't match either, I guess it's pretty much treated as this case "the workload is admitted and tolerations change", correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
PodSets( | ||
*utiltesting.MakePodSet("main", 10). | ||
Toleration(corev1.Toleration{ | ||
Key: "tolarationkey1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key: "tolarationkey1", | |
Key: "tolerationkey1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/assign |
2a88403
to
536fb42
Compare
if canBePartiallyAdmitted && ps.MinCount != nil { | ||
// update the expected running count | ||
ps.Count = psi.Count | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this just be:
psi.Count = psi.Count
?
If the value is set, admission already determined if partial admission was possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, if we are not using partial admission we need to be strict about the count
// If the workload is admitted but the job is suspended, ignore counts. | ||
// This might allow some violating jobs to pass equivalency checks, but their | ||
// workloads would be invalidated in the next sync after unsuspending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a version of this comment should be left before return job.IsSuspended() && equality.ComparePodSetSlices(jobPodSets, wl.Spec.PodSets)
I guess we are saying that, if the job is suspended, it can match either the running or the base specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Obj(), | ||
}, | ||
}, | ||
"the workload is admitted and tolerations change": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to cover what I describe above: "a suspended job should match either the running or base specs".
If it doesn't match either, I guess it's pretty much treated as this case "the workload is admitted and tolerations change", correct?
536fb42
to
df14fd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm
pkg/util/equality/podset_test.go
Outdated
Effect: corev1.TaintEffectNoSchedule, | ||
}).Obj()}, | ||
wantEqual: false, | ||
}, | ||
"different count when checked": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"different count when checked": { | |
"different count": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -607,8 +613,39 @@ func (r *JobReconciler) ensurePrebuiltWorkloadInSync(ctx context.Context, wl *ku | |||
return true, nil | |||
} | |||
|
|||
// get the expected podsets during the job execution, returns nil if the workload has no reservation or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// get the expected podsets during the job execution, returns nil if the workload has no reservation or | |
// expectedRunningPodSets gets the expected podsets during the job execution, returns nil if the workload has no reservation or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/hold
for nits
nits covered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
LGTM label has been added. Git tree hash: 480a555650ad54d8cfc4e0105aa398e13eb0b32c
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, stuton, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/remove-kind cleanup |
/release-note-edit
|
/release-note-edit
|
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #1264
Special notes for your reviewer:
Does this PR introduce a user-facing change?