Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Finish CleanupJob early if the job is suspended. #2243

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pkg/controller.v1/common/job.go
Original file line number Diff line number Diff line change
Expand Up @@ -421,7 +421,7 @@ func (jc *JobController) CleanupJob(runPolicy *apiv1.RunPolicy, jobStatus apiv1.
currentTime := time.Now()
metaObject, _ := job.(metav1.Object)
ttl := runPolicy.TTLSecondsAfterFinished
if ttl == nil {
if ttl == nil || trainutil.IsJobSuspended(runPolicy) {
return nil
}
duration := time.Second * time.Duration(*ttl)
Expand Down
24 changes: 24 additions & 0 deletions pkg/controller.v1/tensorflow/job_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -663,6 +663,30 @@ var _ = Describe("Test for controller.v1/common", func() {
wantTFJobIsRemoved: false,
wantErr: false,
}),
Entry("No error with completionTime is nil if suspended", &cleanUpCases{
tfJob: tftestutil.NewTFJobWithCleanupJobDelay(1, 2, 0, nil),
runPolicy: &kubeflowv1.RunPolicy{
TTLSecondsAfterFinished: nil,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TTLSecondsAfterFinished: nil,
TTLSecondsAfterFinished: ptr.To[int32](10),

Shouldn't we need to specify the ttlSecondsAfterFinished?
Because previously, there were bugs in the situations where the Job has ttlsSecondsAfterFinished and has been suspended, right?

Copy link
Contributor Author

@mszadkow mszadkow Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yes, this should be another test case

Suspend: ptr.To(true),
},
jobStatus: kubeflowv1.JobStatus{
CompletionTime: nil,
},
wantTFJobIsRemoved: false,
wantErr: false,
}),
Entry("No error with TTL is set and completionTime is nil, if suspended", &cleanUpCases{
tfJob: tftestutil.NewTFJobWithCleanupJobDelay(1, 2, 0, ptr.To[int32](10)),
runPolicy: &kubeflowv1.RunPolicy{
TTLSecondsAfterFinished: ptr.To[int32](10),
Suspend: ptr.To(true),
},
jobStatus: kubeflowv1.JobStatus{
CompletionTime: nil,
},
wantTFJobIsRemoved: false,
wantErr: false,
}),
Entry("Error is occurred since completionTime is nil", &cleanUpCases{
tfJob: tftestutil.NewTFJobWithCleanupJobDelay(1, 2, 0, ptr.To[int32](10)),
runPolicy: &kubeflowv1.RunPolicy{
Expand Down
Loading