-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for ttl cleanup for finished jobsets #443
feat: add support for ttl cleanup for finished jobsets #443
Conversation
Hi @dejanzele. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
Reference to the closed PR with original comments - #374 Should I add the metrics here also (I added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @dejanzele! I did a first pass reviewing the PR, once those comments are addressed I'll take another look.
pkg/controllers/jobset_controller.go
Outdated
if jobSetFinished(&js) { | ||
if err := r.deleteJobs(ctx, ownedJobs.active); err != nil { | ||
log.Error(err, "deleting jobs") | ||
return ctrl.Result{}, err | ||
} | ||
if js.Spec.TTLSecondsAfterFinished != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If TTLSecondsAfterFinished is set, we want to wait to delete child Jobs until after the TTL has expired. This gives the user time to inspect/debug the pods before everything is deleted, without having to formulate queries against cloud (or on-prem) log storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, I reordered the job deletion to be below
/ok-to-test |
@dejanzele are you still working on this? |
yes, I was off due to KubeCon, now I can resolve your comments |
751be2f
to
463ae7a
Compare
60b70e3
to
586fa41
Compare
586fa41
to
15fcc39
Compare
I will take another look at this today |
@dejanzele can you add an example JobSet yaml to the examples/simple/ folder which utilizes this feature? |
/hold I will take a quick look |
10a2309
to
6b60eed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one last minor comment.
6b60eed
to
ef0e4c6
Compare
@danielvegamyhre @ahg-g I think all of your comments have been addressed. |
/lgtm I'll leave approval to @ahg-g so he can confirm his comments have been resolved. |
// If the JobSet has expired, it deletes the JobSet. | ||
// If the JobSet has not expired, it returns the time after which the JobSet should be requeued. | ||
// If the JobSet does not have a TTLSecondsAfterFinished set, it returns 0. | ||
func (r *JobSetReconciler) executeTTLAfterFinishedPolicy(ctx context.Context, js *jobset.JobSet) (time.Duration, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we don't make this a member function of the reconciler because it is living in a different file now, same pkg, but still a different file. You can simply pass in the clock as a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added unit tests using the fake client
js1 := testJobSet(ns1).TTLSecondsAfterFinished(2).Obj() | ||
js2 := testJobSet(ns2).TTLSecondsAfterFinished(2).Obj() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better test is to have one with TTL set and one without
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
// The following 2 checks do sanity checking for nil pointers in case of changes to the above function. | ||
// This logic should never be executed. | ||
if now == nil || finishAt == nil || expireAt == nil { | ||
log.V(2).Info("Warning: Calculated invalid expiration time. JobSet cleanup will be deferred.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use error pls instead of info and remove the warning prefix pls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I log it as error or return an error?
I don't know should we fail Reconciliation if it goes into this loop, in theory we should never execute this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually return an error here
…oving assertion which can flake
ef0e4c6
to
849ba92
Compare
@danielvegamyhre @ahg-g is there a chance this feature makes it into the v0.5.0 release? |
Lightly following this, we called this out in the release tracking issue. And it seems that the PR is almost there. So I think its possible. its the last item for the release. |
I think so, we plan to cut the release tomorrow but we can wait an extra day or two if necessary. |
Cool, I think all comments are addressed |
/label tide/merge-method-squash /lgtm Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, dejanzele The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ahg-g I think you need to unhold also |
/hold cancel |
testutil.JobSetCompleted(ctx, k8sClient, js2, timeout) | ||
|
||
// Verify active jobs have been deleted after ttl has passed. | ||
testutil.ExpectJobsDeletionTimestamp(ctx, k8sClient, js1, testutil.NumExpectedJobs(js1)-1, timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we just checking for the timestamp instead of checking that the jobs are actually deleted?
if err := k8sClient.Get(ctx, client.ObjectKeyFromObject(js2), &fresh2); err != nil { | ||
return false | ||
} | ||
return !fresh1.DeletionTimestamp.IsZero() && fresh2.DeletionTimestamp.IsZero() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, why are we only checking that the deletion timestamp is set instead of checking that the jobset is notFound?
CHANGELOG
TTLSecondsAfterFinished *int32
inJobSetSpec
with validation for minimal value 0make manifests
andmake generate
to update autogenerated codeThis PR closes #279