-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add examples for three existing failure policy actions. #601
Add examples for three existing failure policy actions. #601
Conversation
Hi @jedwins1998. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
/ok-to-test |
Add examples for each of the following failure policy actions: 1. FailJobSet, 2. RestartJobSet, 3. RestartJobSetAndIgnoreMaxRestarts.
…ilureReasons present.
…rereasons-present.yaml'.
…PodFailurePolicy.
Is this ready to merge? |
As of last week, Giuseppe (AI Infra team in GKE) was taking over this PR. |
I spoke with Giuseppe and I'll finish up this PR since I started it. |
I added an example similar to a host maintenance event. I also added short descriptions of the expected behavior in each example. I consider this PR ready to merge now. |
- action: RestartJobSetAndIgnoreMaxRestarts | ||
onJobFailureReasons: | ||
- PodFailurePolicy | ||
# The JobSet is restarted as normal when the leader job fails and the above rule is not matched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean for this to restart with maxRestarts 0
? It would fail right away right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. It would fail right away.
failurePolicy: | ||
maxRestarts: 3 | ||
rules: | ||
# The JobSet will restart and unlimited number of times when the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# The JobSet will restart and unlimited number of times when the | |
# The JobSet will restart an unlimited number of times when the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
echo "$i" | ||
sleep 1 | ||
done | ||
podFailurePolicy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here explaining this pod failure policy will trigger on host maintenance events when pods are evicted from the affected nodes, thus failing with a condition type of DisruptionTarget
?
Pod failure policy is a fairly new, advanced Job API feature that many users won't be familiar with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danielvegamyhre, jedwins1998 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Going to make a couple of small changes to this in a follow up |
Add examples for each of the following failure policy actions:
Fixes #600.