Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plain Pod gets deleted once admitted via ProvisioningRequest (DWS) #2239

Merged
merged 1 commit into from
May 22, 2024

Conversation

vladikkuzn
Copy link
Contributor

@vladikkuzn vladikkuzn commented May 20, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

Ignores pods' tolerations during equality check for admitted workloads

Which issue(s) this PR fixes:

Fixes #2213

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Pod Integration: Prevent Pod from being deleted when admitted via ProvisioningRequest that has pod updates on tolerations

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels May 20, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 20, 2024
@vladikkuzn
Copy link
Contributor Author

/assign

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 20, 2024
Copy link

netlify bot commented May 20, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 39571d0
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/664b74474d08780008c94fb8
😎 Deploy Preview https://deploy-preview-2239--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@vladikkuzn
Copy link
Contributor Author

/cc @trasc @alculquicondor

@k8s-ci-robot k8s-ci-robot requested a review from trasc May 20, 2024 16:03
@trasc
Copy link
Contributor

trasc commented May 20, 2024

/assign

@trasc
Copy link
Contributor

trasc commented May 20, 2024

/test all

Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 5e8cbb0f5bedafeb681fe1d3b14d0f5068185dee

@vladikkuzn vladikkuzn marked this pull request as ready for review May 21, 2024 07:03
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2024
func comparePodTemplate(a, b *corev1.PodSpec) bool {
if !equality.Semantic.DeepEqual(a.Tolerations, b.Tolerations) {
func comparePodTemplate(a, b *corev1.PodSpec, ignoreTolerations bool) bool {
if !ignoreTolerations && !equality.Semantic.DeepEqual(a.Tolerations, b.Tolerations) {
Copy link
Contributor

@mimowo mimowo May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this PR only ignores the equality check on tolerations, while the comment indicates we should also relax the validation for nodeSelectors. How confident we are tolerations are enough?
IIUC the autoscaling.gke.io/provisioning-request nodeSelector can be added as well based on the annotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node selectors are not part of the equivalency checks.

Copy link
Contributor

@mimowo mimowo May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for clarifying. Do we have this covered in a unit / integration test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ignored, because it doesn't participate in comparison as of now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just like to have a test to make sure this is the case, to prevent future regressions, since this is an important scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway we can probably defer this to a follow up manual testing

@@ -745,17 +745,17 @@ func equivalentToWorkload(ctx context.Context, c client.Client, job GenericJob,
jobPodSets := clearMinCountsIfFeatureDisabled(job.PodSets())

if runningPodSets := expectedRunningPodSets(ctx, c, wl); runningPodSets != nil {
if equality.ComparePodSetSlices(jobPodSets, runningPodSets) {
if equality.ComparePodSetSlices(jobPodSets, runningPodSets, workload.IsAdmitted(wl)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ignoring of tolerations is done independently of the framework, but there is no issue with jobs, just with pods. I'm wondering if we should make the fix more specific. WDYT @alculquicondor ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand the relaxation of the check isn't necessary for Jobs.

On the other it seems that scoping the fix would be more involving because we would need to pass this information, the options I see:

  1. add a new param to FindMatchingWorkloads
  2. add a new interface like CustomEquivalenceConfigJob with a function like ignoreTolerations() bool

None of these is appealing, especially since we will not need the custom options when the proper fix is implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this for now. But can we use the condition status of QuotaReserved, instead of Admitted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes any difference, the toleration are only changing after the pods are unsuspended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, thanks for the reminder.

wantJob: *baseJobWrapper.Clone().Toleration(corev1.Toleration{
Key: "tolerationkey1",
Key: "tolerationkey2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure we have an analogous test (either by adding or finding an existing) which demonstrate that node selectors are not part of the equality check.

@@ -745,17 +745,17 @@ func equivalentToWorkload(ctx context.Context, c client.Client, job GenericJob,
jobPodSets := clearMinCountsIfFeatureDisabled(job.PodSets())

if runningPodSets := expectedRunningPodSets(ctx, c, wl); runningPodSets != nil {
if equality.ComparePodSetSlices(jobPodSets, runningPodSets) {
if equality.ComparePodSetSlices(jobPodSets, runningPodSets, workload.IsAdmitted(wl)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this for now. But can we use the condition status of QuotaReserved, instead of Admitted?

for name, tc := range cases {
t.Run(name, func(t *testing.T) {
got := ComparePodSetSlices(tc.a, tc.b)
got := ComparePodSetSlices(tc.a, tc.b, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a unit test for true too

@alculquicondor
Copy link
Contributor

/approve
/hold for additional test

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, vladikkuzn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2024
@trasc
Copy link
Contributor

trasc commented May 22, 2024

/test pull-kueue-test-integration-main

@alculquicondor
Copy link
Contributor

we can leave the additional test for a follow up.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 22, 2024
@alculquicondor
Copy link
Contributor

/cherry-pick release-0.6

@k8s-infra-cherrypick-robot

@alculquicondor: once the present PR merges, I will cherry-pick it on top of release-0.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot merged commit 9c867c9 into kubernetes-sigs:main May 22, 2024
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone May 22, 2024
@k8s-infra-cherrypick-robot

@alculquicondor: #2239 failed to apply on top of branch "release-0.6":

Applying: Plain Pod gets deleted once admitted via ProvisioningRequest (DWS)
Using index info to reconstruct a base tree...
M	pkg/controller/jobframework/reconciler.go
M	pkg/controller/jobs/job/job_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controller/jobs/job/job_controller_test.go
CONFLICT (content): Merge conflict in pkg/controller/jobs/job/job_controller_test.go
Auto-merging pkg/controller/jobframework/reconciler.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Plain Pod gets deleted once admitted via ProvisioningRequest (DWS)
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vladikkuzn added a commit to epam/kubernetes-kueue that referenced this pull request May 22, 2024
* Test for ignoreTolerations
@vladikkuzn vladikkuzn mentioned this pull request May 22, 2024
@vladikkuzn vladikkuzn deleted the Plain-Pod-gets-deleted branch May 22, 2024 18:27
vladikkuzn added a commit to epam/kubernetes-kueue that referenced this pull request May 22, 2024
* Test for ignoreTolerations
vladikkuzn added a commit to epam/kubernetes-kueue that referenced this pull request May 22, 2024
* Test for ignoreTolerations
* Add node selector to test to make sure it's not a part of equality check
vladikkuzn added a commit to epam/kubernetes-kueue that referenced this pull request May 23, 2024
* Test for ignoreTolerations
* Add node selector to test to make sure it's not a part of equality check
vladikkuzn added a commit to epam/kubernetes-kueue that referenced this pull request May 23, 2024
* Test for ignoreTolerations
* Add node selector to test to make sure it's not a part of equality check
vladikkuzn added a commit to epam/kubernetes-kueue that referenced this pull request May 23, 2024
* Add separate unit test for node selectors
vladikkuzn added a commit to epam/kubernetes-kueue that referenced this pull request May 23, 2024
* Add separate unit test for node selectors
vladikkuzn added a commit to epam/kubernetes-kueue that referenced this pull request May 23, 2024
* Rename wantEqual -> wantEquivalent
k8s-ci-robot pushed a commit that referenced this pull request May 23, 2024
* Follow-up of #2239

* Test for ignoreTolerations
* Add node selector to test to make sure it's not a part of equality check

* Follow-up of #2239

* Add separate unit test for node selectors

* Follow-up of #2239

* Rename wantEqual -> wantEquivalent
@alculquicondor
Copy link
Contributor

/release-note-edit

Prevent Pod from being deleted when admitted via ProvisioningRequest that has pod updates on tolerations

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 27, 2024
@alculquicondor
Copy link
Contributor

/release-note-edit

Pod Integration: Prevent Pod from being deleted when admitted via ProvisioningRequest that has pod updates on tolerations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plain Pod gets deleted once admitted via ProvisioningRequest (DWS)
6 participants