-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement configurable failure policy. #537
Implement configurable failure policy. #537
Conversation
Hi @jedwins1998. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
There is one TODO left for an additional test I would like to add and I still need to implement Webhook validation for OnJobFailureReasons. Besides that, I consider the code ready for review. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move all the helpers in jobset_controller.go specific to failure policies to a failure_policy.go file, and add unit tests for any important ones in failure_policy_test.go? Same as success_policy.go and success_policy_test.go.
I'll take a deeper look next week. Thanks for working on this!
Can do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick pass while you are working on the refactor, looks good so far!
This is now done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look at the integration tests yet
pls don't amend the commits, it makes it hard to review because we can't tell the diff |
@ahg-g, there is a "Compare" button next to each amended commit to see the diff. Does that not do what you are looking for or are you referring to something else? In addition, I am using amend as I am trying to avoid what happened when I merged pull request 487 and all the individual commits were included. |
It is hard to find, more importantly it doesn't allow the reviewer to add comments while comparing the two diffs.
We can squash at the end of the review. |
Is it possible to denote/mark the PR as intended as a squash merge? I would like to do it now so there is not a chance of forgetting later. @danielvegamyhre had mentioned I can use a |
simply hold the PR, and once you get the all clear from the reviewers, send a squashed commit and cancel the hold. /hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the tests
…indFirstFailedPolicyRuleAndJob.
…to be the first failure policy rule test.
…case names more clear.
I added `[failure policy]` to the begin of the name of each test related to failure policies so that it is easier to select only those tests to run. I also updated tests to check that `RestartsCountTowardsMax` is incrementing only when expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after a few final comments are addressed. In a follow-up PR, we should add an example JobSet spec to the examples/ folder showcasing how the feature works, which we can use to do manual testing as well.
RestartJobSetAndIgnoreMaxRestarts FailurePolicyAction = "RestartJobSetAndIgnoreMaxRestarts" | ||
) | ||
|
||
// FailurePolicyRule defines a FailurePolicyAction to be executed if a child job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is a general comment unrelated to this file) Now that feature gate support has been merged in #557 I think we should add a feature gate (default on) for this feature. If the feature is not enabled, fall back to the current behavior.
We can do this in a follow up PR.
I want to avoid a scenario where we publish the v0.6.0 release and an important customer is using this feature, then they encounter a bug that slipped through the cracks, and we can't simply downgrade to v0.5.0 to mitigate because their JobSet spec (often defined in Python/Go code checked into their codebase) is using fields which only exist in v0.6.0 - thus requiring some emergency rollout on their end to revert their Python/Go code to a spec usable by JobSet v0.5.0, and then downgrade JobSet deployment to v0.5.0.
/lgtm Thanks for working on this! Will leave approval for @ahg-g |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, jedwins1998 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
This pull request is to implement configurable failure policy.
There is one difference to note from the KEP. I added a new field to the JobSetStatus that tracks the number of restarts which count towards the restart limit. I then use this variable to allow some restarts to not count towards the maximum number of restarts.
This resolves #262.