Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding toleration to the job doesn't trigger workload change #1304

Merged

Conversation

stuton
Copy link
Contributor

@stuton stuton commented Nov 2, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #1264

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Changing tolerations in an inadmissible job triggers an admission retry with the updated tolerations.

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Nov 2, 2023
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Nov 2, 2023
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 2, 2023
Copy link

netlify bot commented Nov 2, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit ed5c81f
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65a613e39f906b0008178c82

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 2, 2023
@stuton stuton marked this pull request as ready for review November 3, 2023 13:48
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 3, 2023
@stuton stuton changed the title Adding toleration to the job doesn't trigger workload change [WIP] Adding toleration to the job doesn't trigger workload change Nov 3, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 3, 2023
@stuton stuton force-pushed the update-workload-when-add-toleration branch from d446ad9 to 1a99a26 Compare November 13, 2023 12:03
@stuton stuton changed the title [WIP] Adding toleration to the job doesn't trigger workload change Adding toleration to the job doesn't trigger workload change Nov 13, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 13, 2023
@@ -26,30 +26,33 @@ import (

// TODO: Revisit this, maybe we should extend the check to everything that could potentially impact
// the workload scheduling (priority, nodeSelectors(when suspended), tolerations and maybe more)
func comparePodTemplate(a, b *corev1.PodSpec) bool {
func comparePodTemplate(a, b *corev1.PodSpec, checkCount, changePodSpecFields bool) bool {
if changePodSpecFields && !equality.Semantic.DeepEqual(a.Tolerations, b.Tolerations) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this boolean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tolerations could change when the job is unsuspended

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a unit test for the job controller where we can see that the Workload object is updated when you change the toleration.

And another one where if you change the toleration for a Job that is admitted, the Workload gets evicted.

@trasc
Copy link
Contributor

trasc commented Dec 15, 2023

And another one where if you change the toleration for a Job that is admitted, the Workload gets evicted.

Is this a very common use-case, I think is easier to make the tolerations immutable while the job is not suspended.

@alculquicondor
Copy link
Contributor

Is this a very common use-case, I think is easier to make the tolerations immutable while the job is not suspended.

Tolerations are already immutable for k8s Jobs when not suspended https://kubernetes.io/docs/concepts/workloads/controllers/job/#mutable-scheduling-directives

But that's not true for all CRDs. We might be able to add the restriction as webhooks for jobs we have integrations for, but we can't control other implementations. So worth having the safe guard in the reconciler.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 19, 2023
@trasc trasc force-pushed the update-workload-when-add-toleration branch from b3cd54d to 2a88403 Compare December 19, 2023 14:22
@trasc
Copy link
Contributor

trasc commented Dec 19, 2023

Let's add a unit test for the job controller where we can see that the Workload object is updated when you change the toleration.

And another one where if you change the toleration for a Job that is admitted, the Workload gets evicted.

Done, however the overall experience for the case when the tolerations are changed while admitted is at least strange, the job is suspended and the workload is removed but the tolerations from the workload are restored in the job ...

@trasc trasc force-pushed the update-workload-when-add-toleration branch from d3fb537 to 2a88403 Compare December 19, 2023 15:05
Obj(),
},
},
"the workload is admitted and tolerations change": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there already a case where the workload is admitted, but the job is still suspended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"suspended job with matching admitted workload is unsuspended" .. why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to cover what I describe above: "a suspended job should match either the running or base specs".

If it doesn't match either, I guess it's pretty much treated as this case "the workload is admitted and tolerations change", correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

PodSets(
*utiltesting.MakePodSet("main", 10).
Toleration(corev1.Toleration{
Key: "tolarationkey1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Key: "tolarationkey1",
Key: "tolerationkey1",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pkg/controller/jobs/job/job_controller_test.go Outdated Show resolved Hide resolved
pkg/util/equality/podset.go Show resolved Hide resolved
@trasc
Copy link
Contributor

trasc commented Jan 11, 2024

/assign

@trasc trasc force-pushed the update-workload-when-add-toleration branch from 2a88403 to 536fb42 Compare January 12, 2024 08:08
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
Comment on lines +639 to +642
if canBePartiallyAdmitted && ps.MinCount != nil {
// update the expected running count
ps.Count = psi.Count
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this just be:

psi.Count = psi.Count

?

If the value is set, admission already determined if partial admission was possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, if we are not using partial admission we need to be strict about the count

Comment on lines -631 to -633
// If the workload is admitted but the job is suspended, ignore counts.
// This might allow some violating jobs to pass equivalency checks, but their
// workloads would be invalidated in the next sync after unsuspending.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a version of this comment should be left before return job.IsSuspended() && equality.ComparePodSetSlices(jobPodSets, wl.Spec.PodSets)

I guess we are saying that, if the job is suspended, it can match either the running or the base specs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Obj(),
},
},
"the workload is admitted and tolerations change": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to cover what I describe above: "a suspended job should match either the running or base specs".

If it doesn't match either, I guess it's pretty much treated as this case "the workload is admitted and tolerations change", correct?

pkg/util/equality/podset.go Show resolved Hide resolved
@trasc trasc force-pushed the update-workload-when-add-toleration branch from 536fb42 to df14fd3 Compare January 15, 2024 08:34
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

Effect: corev1.TaintEffectNoSchedule,
}).Obj()},
wantEqual: false,
},
"different count when checked": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"different count when checked": {
"different count": {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -607,8 +613,39 @@ func (r *JobReconciler) ensurePrebuiltWorkloadInSync(ctx context.Context, wl *ku
return true, nil
}

// get the expected podsets during the job execution, returns nil if the workload has no reservation or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// get the expected podsets during the job execution, returns nil if the workload has no reservation or
// expectedRunningPodSets gets the expected podsets during the job execution, returns nil if the workload has no reservation or

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/hold
for nits

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 15, 2024
@trasc
Copy link
Contributor

trasc commented Jan 16, 2024

nits covered
/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2024
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 480a555650ad54d8cfc4e0105aa398e13eb0b32c

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, stuton, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 8a6296f into kubernetes-sigs:main Jan 16, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Jan 16, 2024
@alculquicondor
Copy link
Contributor

/remove-kind cleanup
/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Feb 13, 2024
@alculquicondor
Copy link
Contributor

/release-note-edit

Changing tolerations in an inadmissible job triggers an admission retry.

@alculquicondor
Copy link
Contributor

/release-note-edit

Changing tolerations in an inadmissible job triggers an admission retry with the updated tolerations.

@trasc trasc deleted the update-workload-when-add-toleration branch March 12, 2024 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding toleration to the job doesn't trigger workload change.
5 participants