Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspend a running Job without requeueing #1252

Merged

Conversation

vicentefb
Copy link
Contributor

@vicentefb vicentefb commented Oct 24, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Some systems need the ability to "terminate" arbitrary jobs (because it's running for too long or other policies) without deleting the Job object, so it can be debugged.

Which issue(s) this PR fixes:

Fixes #1091

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Support for a mechanism to suspend a running Job without requeueing

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Oct 24, 2023
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 24, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: vicentefb / name: Vicente Ferrara (8f3ef1b)

@netlify
Copy link

netlify bot commented Oct 24, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 8f3ef1b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65693f9f91ab040008cab7c8

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Oct 24, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @vicentefb!

It looks like this is your first PR to kubernetes-sigs/kueue 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kueue has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 24, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @vicentefb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 24, 2023
apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved
apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
pkg/workload/workload.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 24, 2023
@tenzen-y
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 25, 2023
@tenzen-y
Copy link
Member

@vicentefb @andrewsykim I think we should write a KEP before moving this forward since this is a slightly big change.

cc: @alculquicondor

@andrewsykim
Copy link
Member

@tenzen-y we had some back and forth in #1091 about the API design. I wouldn't consider it a big change since it's only introducing a single (optional) field, but I'm fairly new to the code base and I might be missing something.

Is there specific concern or issue you want covered in more details that would warrant a KEP? Happy to write one but I didn't feel this change was that big. Also see discussion in #1091 for more context

@tenzen-y
Copy link
Member

@tenzen-y we had some back and forth in #1091 about the API design. I wouldn't consider it a big change since it's only introducing a single (optional) field, but I'm fairly new to the code base and I might be missing something.

Is there specific concern or issue you want covered in more details that would warrant a KEP? Happy to write one but I didn't feel this change was that big. Also see discussion in #1091 for more context

Let me check the discussion on that issue.

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are also missing a change in pkg/controller/jobframework/reconciler.go so that when we find a job that has:

  • Workload.spec.queueingPolicy: Never
  • Job.suspend: true
  • Workload.status.admission not nil and Workload.status.Condition[Admitted]=true)

Then, set Workload.status.Condition[Evicted]=true

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should be able to modify queueingPolicy via Job like suspend
since I'm wondering if we shouldn't provide features that must manually modify workloads.

Manually modifying workloads looks slightly have risks.

Actually, we used to provide all features so that users can be able to propagate via Job to workloads.

@alculquicondor @kerthcet @mimowo WDYT?

apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved
apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved
@@ -59,6 +59,15 @@ type WorkloadSpec struct {
// +kubebuilder:default=""
// +kubebuilder:validation:Enum=kueue.x-k8s.io/workloadpriorityclass;scheduling.k8s.io/priorityclass;""
PriorityClassSource string `json:"priorityClassSource,omitempty"`

// QueueingPolicy that will determine if a job needs to be requeued or not after being suspended
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// QueueingPolicy that will determine if a job needs to be requeued or not after being suspended
// ReQueueingPolicy that will determine if a job needs to be requeued or not after being suspended

Also, QueuingPolicy sounds like the queuing method can be changed regardless of whether the job is suspended or not.

So, we should mention as an API name that this policy will apply to the job after jobs are admitted at least once.

WDYT? @alculquicondor @kerthcet

Copy link
Member

@andrewsykim andrewsykim Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we should mention as an API name that this policy will apply to the job after jobs are admitted at least once.

I think in most cases the use-case will be suspending already admitted jobs, but you can have a job that is not yet admitted (due to no available resources) and then have queueing disabled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, QueuingPolicy sounds like the queuing method can be changed regardless of whether the job is suspended or not.

Yes, this is true as of the current implementation. If you disable for suspended job, it will be suspended forever. If you disable for a running job, it will NOT suspend the job but disable re-queueing if manually suspended

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you can have a job that is not yet admitted (due to no available resources) and then have queueing disabled

Currently, we can disable queuing by removing the queue name label from the job.
So I think that it will cause confusion for users to support the same feature in multiple ways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the QueueingPolicy=Never might be useful on the workflow use case: #1091 (comment).

However, if we introduce job readiness gates (kubernetes/kubernetes#121681), we have similar features on both the kueue side and the job-controller side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM

Maybe QueueingPolicy=Never and Job readiness gates can co-exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, we still need a representation in the Workload API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I imagined the following steps:

  1. The external controller adds a gate to the job.
  2. jobframework reconciler creates workload with QueueingPolicy=Never.
  3. The external controller deletes a gate to the job.
  4. jobframework reconciler updates workload with QueueingPolicy=ALways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I agree with the QueueingPolicy={Always, Never} API instead of my suggestion, ReQueuingPolicy={Never}.

@alculquicondor
Copy link
Contributor

Manually modifying workloads looks slightly have risks.

Actually, we used to provide all features so that users can be able to propagate via Job to workloads.

For now, this is just meant to be used by administrators, which will have access to the Workload object.

Other than that, it's important that we start working on a CLI #487, which would allow administrators to make these kind of changes safely.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2023
pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash / rebase commits, otherwise LGTM :)

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
I'll leave the LGTM to @andrewsykim
/hold
for review comments

Please squash

apis/kueue/v1beta1/workload_types.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 30, 2023
@vicentefb vicentefb force-pushed the SuspendableJobWithoutRequeuing branch from 10b35fb to d3e0d18 Compare November 30, 2023 22:19
@vicentefb
Copy link
Contributor Author

/retest

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

/approve

pkg/util/testing/wrappers.go Show resolved Hide resolved
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, tenzen-y, vicentefb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alculquicondor,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vicentefb vicentefb force-pushed the SuspendableJobWithoutRequeuing branch from d3e0d18 to 7341b38 Compare November 30, 2023 23:40
@vicentefb
Copy link
Contributor Author

/retest

@tenzen-y
Copy link
Member

tenzen-y commented Dec 1, 2023

Flaky test happened: #1389

removes evicted condition status update, removes job_controller unit tests, adds pkg job unit tests

updated workload_controller log

added StrictFIFO queueing strategy in e2e test while creating a second job when the first one is suspended

changed behaviour in reconciler to allow kueue suspend a job directly from the workload signal, updated integration and job test, still missing to update the field name

changing implementation to use active and not suspending the job manually

updated yaml file

addressed comments, removed log lines, used step 6 to handle workload eviction, fixed Active field to be a pointer, fixed e2e tests to use on Consistently method, still missing to make one unit test case work

updated unit tests and integration tests with new implementation to evict a workload based on spec.active field

added crd charts yaml

added workloadspec.go file

updated e2e and unit tests, missing to rebase commits

added zz generated file

addressed comments, moved e2e test to happen inside integration tests

nit comments addressed

nit comments addressed

final nit comments addressed and changed Eviction constant to be WorkloadEvictedByDeactivation

deleted by accident a comment, reverted

updated workload types comment
@vicentefb vicentefb force-pushed the SuspendableJobWithoutRequeuing branch from 7341b38 to 8f3ef1b Compare December 1, 2023 02:06
Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update release notes!

Thanks @vicentefb

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 1, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8dbf7e4995e772630f663b245e3c1011a8a1195c

@tenzen-y
Copy link
Member

tenzen-y commented Dec 1, 2023

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 1, 2023
@k8s-ci-robot k8s-ci-robot merged commit 262c366 into kubernetes-sigs:main Dec 1, 2023
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Dec 1, 2023
@tenzen-y
Copy link
Member

tenzen-y commented Dec 1, 2023

/release-note-edit

Support for a mechanism to suspend a running Job without requeueing

@tenzen-y
Copy link
Member

/kind api-change

@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A mechanism to suspend a running Job without requeueing
7 participants