Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Finish CleanupJob early if the job is suspended. #2243

Conversation

mszadkow
Copy link
Contributor

What this PR does / why we need it:
To fix the bug related to the situation when the job was both suspendedandrunPolicy.ttlSecondsAfterFinished` was set.
In such situation CleanupJob was returning an error and activate status of the replicas couldn't be removed.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2239

Checklist:

  • Docs included if any changes are user facing

@mszadkow
Copy link
Contributor Author

cc @tenzen-y

@mszadkow
Copy link
Contributor Author

cc @alculquicondor

@mszadkow mszadkow force-pushed the bug/broken-preemption-on-suspended-with-ttl branch from 9b3b976 to c4d2858 Compare August 29, 2024 17:00
Copy link

@mszadkow: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@coveralls
Copy link

coveralls commented Aug 29, 2024

Pull Request Test Coverage Report for Build 10631502623

Details

  • 0 of 1 (0.0%) changed or added relevant line in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.06%) to 31.801%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/common/job.go 0 1 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob.go 1 91.06%
Totals Coverage Status
Change from base Build 10600609425: 0.06%
Covered Lines: 3950
Relevant Lines: 12421

💛 - Coveralls

@mszadkow mszadkow force-pushed the bug/broken-preemption-on-suspended-with-ttl branch from c4d2858 to 807c6cc Compare August 30, 2024 08:38
@google-oss-prow google-oss-prow bot added size/S and removed size/M labels Aug 30, 2024
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
@mszadkow mszadkow force-pushed the bug/broken-preemption-on-suspended-with-ttl branch from 84702d6 to 6a7748a Compare August 30, 2024 09:57
Entry("No error with completionTime is nil if suspended", &cleanUpCases{
tfJob: tftestutil.NewTFJobWithCleanupJobDelay(1, 2, 0, nil),
runPolicy: &kubeflowv1.RunPolicy{
TTLSecondsAfterFinished: nil,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TTLSecondsAfterFinished: nil,
TTLSecondsAfterFinished: ptr.To[int32](10),

Shouldn't we need to specify the ttlSecondsAfterFinished?
Because previously, there were bugs in the situations where the Job has ttlsSecondsAfterFinished and has been suspended, right?

Copy link
Contributor Author

@mszadkow mszadkow Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yes, this should be another test case

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
@mszadkow mszadkow force-pushed the bug/broken-preemption-on-suspended-with-ttl branch from 3db44ec to cbc8456 Compare August 30, 2024 10:59
@mszadkow mszadkow marked this pull request as ready for review August 30, 2024 11:28
Copy link

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks!
/approve
/lgtm

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 98df3a8 into kubeflow:master Aug 30, 2024
39 checks passed
tenzen-y pushed a commit to tenzen-y/training-operator that referenced this pull request Aug 30, 2024
* No cleaning up a job if the job is suspended.

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* run fmt

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Another test case

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

---------

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
@tenzen-y tenzen-y mentioned this pull request Aug 30, 2024
1 task
tenzen-y pushed a commit to tenzen-y/training-operator that referenced this pull request Aug 30, 2024
* No cleaning up a job if the job is suspended.

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
google-oss-prow bot pushed a commit that referenced this pull request Aug 30, 2024
* No cleaning up a job if the job is suspended.

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Michał Szadkowski <michalszadkowski@yahoo.pl>
@mszadkow mszadkow deleted the bug/broken-preemption-on-suspended-with-ttl branch September 2, 2024 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken preemption on TFJob with non default runPolicy.ttlSecondsAfterFinished
4 participants