-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for ttl after finished controller for cleaning up finished jobsets #374
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dejanzele The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @dejanzele. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@kannon92 @danielvegamyhre as I have issues running JobSet API locally on my M1, can you please help me verify this works? |
a52abad
to
7b80317
Compare
/cc @danielvegamyhre |
/ok-to-test |
7b80317
to
d714d55
Compare
return false | ||
} | ||
return !fresh.DeletionTimestamp.IsZero() | ||
}, 15*time.Second, 500*time.Millisecond).Should(gomega.BeTrue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test case to verify jobs and pods are also deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that be tested in the JobSet controller, as it is the responsibility of that controller? TTL After Finished controlled shouldn't know or care about cleanup logic besides marking JobSet for deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the integration tests, there is no job controller running, so no pods will exist.
We could add an E2E test case for this, creating a JobSet in an isolated namespace, completing it and letting the TTL expire, then checking that no resources exist in the namespace anymore.
Given this is a pretty large PR with substantial changes like adding a new controller, metrics etc I think an e2e test case would give us more confidence in merging this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks for working on this. Glad you added metrics for the TTL deletion duration as well, that is a nice addition.
I added some comments, once those are addressed I'll take another look since this is a pretty large PR.
Also, have you manually tested theses changes, or just used unit tests and integration tests so far?
@@ -157,6 +163,22 @@ func setupControllers(mgr ctrl.Manager, certsReady chan struct{}) { | |||
os.Exit(1) | |||
} | |||
|
|||
clientset := versioned.NewForConfigOrDie(mgr.GetConfig()) | |||
sharedInformers := externalversions.NewSharedInformerFactory(clientset, 30*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the 30 minute parameter specifying? Can we define the 30 minute variable as a variable somewhere to make it more obvious what the 30 minute duration is referring to without looking up the docs for this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to use informerfactory, kube-builder sets it up already under the client we get from the manager (mgr.GetClient()
)
log: log, | ||
} | ||
|
||
_, _ = jobSetInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these 2 values being returned that we are ignoring? If one is an error, what reason do we have for ignoring it?
log logr.Logger, | ||
) *TTLAfterFinishedReconciler { | ||
config := workqueue.RateLimitingQueueConfig{Name: "ttl_jobsets_to_delete"} | ||
tc := &TTLAfterFinishedReconciler{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we name this variable something besides tc
? We usually use tc
as a variable name for test cases so this is a bit confusing since the following lines look like test code at first glance.
if errors.IsNotFound(err) { | ||
return nil | ||
} | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to:
if client.IgnoreNotFound(err) != nil {
return err
}
See similar example here:
return ctrl.Result{}, client.IgnoreNotFound(err) |
if errors.IsNotFound(err) { | ||
return nil | ||
} | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified in the same way as suggested in my comment above.
// The JobSet's TTL is assumed to have expired, but the JobSet TTL might be stale. | ||
// Before deleting the JobSet, do a final sanity check. | ||
// If TTL is modified before we do this check, we cannot be sure if the TTL truly expires. | ||
// The latest JobSet may have a different UID, but it's fine because the checks will be run again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this, it looks like we are fetching the JobSet and checking the TTL 2 times in a row in the same function, without any logic or delays in between. Why is getting the JobSet state the first time on line 202 not considered "fresh" enough?
} | ||
|
||
now := r.clock.Now() | ||
t, e, err := timeLeft(r.log, jobSet, &now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this variable to something more descriptive than t
and e
? The corresponding variable names in the timeLeft function itself seem good: remaining
and expireAt
return js.Spec.TTLSecondsAfterFinished != nil && jobSetFinished(js) | ||
} | ||
|
||
func getFinishAndExpireTime(js *jobsetv1alpha2.JobSet) (*time.Time, *time.Time, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we place the timeLeft
function above the getFinishAndExpireTime
so the layers of abstraction flow from top to bottom consistently?
return false | ||
} | ||
return !fresh.DeletionTimestamp.IsZero() | ||
}, 15*time.Second, 500*time.Millisecond).Should(gomega.BeTrue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the integration tests, there is no job controller running, so no pods will exist.
We could add an E2E test case for this, creating a JobSet in an isolated namespace, completing it and letting the TTL expire, then checking that no resources exist in the namespace anymore.
Given this is a pretty large PR with substantial changes like adding a new controller, metrics etc I think an e2e test case would give us more confidence in merging this.
log: ctrl.Log.WithValues("controller", "TTLAfterFinished"), | ||
} | ||
|
||
go controller.Run(ctx, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we define 1
as a variable somewhere so it's more clear it is referring to the number of workers, without the reader needing to look up this function and read it's parameters?
This applies to all places where Run() is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, but kube-builder offers most the mechanics implemented in ttlafterfinished_controller.go file. I think the queue management stuff are not necessary.
Why not use the JobSet reconciler itself? Simply queue completed jobsets for the time remaining to delete it (using return ctrl.Result{RequeueAfter: timeToDelete}
)
@@ -157,6 +163,22 @@ func setupControllers(mgr ctrl.Manager, certsReady chan struct{}) { | |||
os.Exit(1) | |||
} | |||
|
|||
clientset := versioned.NewForConfigOrDie(mgr.GetConfig()) | |||
sharedInformers := externalversions.NewSharedInformerFactory(clientset, 30*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to use informerfactory, kube-builder sets it up already under the client we get from the manager (mgr.GetClient()
)
@ahg-g I based it on the TTL After Finished controller for Jobs in Kubernetes. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @dejanzele, are you still working on this? |
closing this due to inactivity |
Hey @ahg-g, I was away on some other work, I plan to get back to this now if nobody took it in the meantime? |
@dejanzele sounds good, looking forward to it. |
CHANGELOG
TTLSecondsAfterFinished *int32
inJobSetSpec
with validation for minimal value 0 and default value 0This PR closes #279