Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to update JobSets on suspend #644

Merged
merged 3 commits into from
Aug 9, 2024

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Aug 9, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

To fix the main scenarios of integration with Kueue, where Kueue may want to suspend a workload,
and re-admit in another ResourceFlavor (with different nodeSelectors)

Which issue(s) this PR fixes:

Fixes #624

Special notes for your reviewer:

This is a minimal fix for the main scenarios needed for integration with Kueue - it does not
allow to "restore" fully PodTemplate, but it allows to overwrite it, which is enough in most cases
See previous approach: #640

There are two commits:

  1. shows that the e2e demonstrates the issue, see example failure: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_jobset/644/pull-jobset-test-e2e-main-1-27/1821972758104379392
  2. the change which fixes the issue

Does this PR introduce a user-facing change?

Allow updating PodTemplate on suspend. This fixes the main scenarios for integration with Kueue
to support eviction of workloads and re-admitting in another ResourceFlavor (with other
nodeSelectors).

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 9, 2024
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 9, 2024
Copy link

netlify bot commented Aug 9, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit c89e112
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/66b66c0f87fd06000850c57f

test/e2e/e2e_test.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 9, 2024
@mimowo mimowo changed the title Fix resuming JobSets after PodTemplate restore Allow to update JobSets on suspend Aug 9, 2024
Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/assign @ahg-g @danielvegamyhre

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 9, 2024
@@ -131,6 +132,70 @@ var _ = ginkgo.Describe("JobSet", func() {
})
})

// This test is added to test the JobSet transitions as Kueue would when:
// doing: resume in RF1 -> suspend -> resume in RF2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the explicit name ResourceFlavor instead of the abbreviation RF for comments/documentation in JobSet repo, since "RF" will be unfamiliar to developers who don't work with Kueue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done


ginkgo.By("Create a suspended JobSet", func() {
js.Spec.Suspend = ptr.To(true)
js.Spec.TTLSecondsAfterFinished = ptr.To[int32](5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to set TTL for this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed really, used it just to make sure the Job is deleted at some point, but we don't need to. Reverted.

test/e2e/e2e_test.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 9, 2024
@danielvegamyhre
Copy link
Contributor

/lgtm
/approve
/hold

I'll remove the hold once CI tests pass.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Aug 9, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Aug 9, 2024

@kannon92 @danielvegamyhre could we cherry-pick this?

@danielvegamyhre
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 9, 2024
@danielvegamyhre
Copy link
Contributor

@kannon92 @danielvegamyhre could we cherry-pick this?

v0.6.0 will be released next week, I think that would be easier. Is it urgent to cherry pick this into v0.5.x?

@k8s-ci-robot k8s-ci-robot merged commit 8bade1e into kubernetes-sigs:main Aug 9, 2024
13 checks passed
@mimowo
Copy link
Contributor Author

mimowo commented Aug 9, 2024

I think 0.6 is fine.

@kannon92
Copy link
Contributor

kannon92 commented Aug 17, 2024

@danielvegamyhre I think you said there was some issues with kicking off v0.6.0 so we are a holding pattern for this. ref: #523 (comment)

Is it worth cherry-picking this to unblock Kueue?

@kannon92
Copy link
Contributor

/cherry-pick release-0.5.0

@kannon92
Copy link
Contributor

/cherry-pick release-0.5

@kannon92
Copy link
Contributor

opened up #651 (comment)

k8s-ci-robot pushed a commit that referenced this pull request Aug 17, 2024
* E2e for updating JobSet on suspend

* Allow to mutate the PodTemplate in JobSet on suspend

* Review remarks

---------

Co-authored-by: Michal Wozniak <michalwozniak@google.com>
@danielvegamyhre danielvegamyhre mentioned this pull request Aug 19, 2024
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow to mutate PodTemplate when suspending a JobSet and support resuming such JobSet
5 participants