Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement configurable failure policy. #537

Merged
merged 25 commits into from
May 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
f038a33
Implement configurable failure policy.
Apr 19, 2024
a3cecb1
Refactor findFirstFailedPolicyRuleAndJob, ruleIsApplicable, and TestF…
May 3, 2024
497a7c7
Move functions related to failure policy to failure_policy.go file.
May 6, 2024
9b5c252
Add name property to failure policy rules.
May 8, 2024
3ea78cf
Add log statement in ruleIsApplicable when parent replicatedJob is no…
May 8, 2024
dcd462b
Add definitions for the fields of eventParams.
May 8, 2024
92a3e8b
Implement event reasons and messages for each failure policy action.
May 8, 2024
898493f
Make the default failure policy rule action more clear.
May 8, 2024
e866397
Refactor TestFailurePolicyRuleIsApplicable to improve standardization.
May 8, 2024
ebac8e1
Remove outdated comment.
May 8, 2024
9af5c06
Create function parseFailJobOpts to remove closure in jobWithFailedCo…
May 8, 2024
ac8c3ea
Refactor TestJobSetDefaulting.
May 9, 2024
e91b6b0
Refactor TestValidateCreate.
May 9, 2024
4de124b
Add logging when a matching failure policy rule is found.
May 9, 2024
c5ae410
In TestValidateCreate, move the test for a valid failure policy rule …
May 9, 2024
9f0d08b
Add comment defining TestValidateCreate.
May 9, 2024
5b8f55d
Refactor TestFindFirstFailedPolicyRuleAndJob to make individual test …
May 14, 2024
0bdffeb
Update jobset controller integration tests related to failure policies.
May 17, 2024
a8354ff
Add check for value of .status.restarts field.
May 17, 2024
bf03055
Remove 'Reason' from the end of RestartJobSetAndIgnoreMaxRestartsActi…
May 20, 2024
715777a
Change the ordering of functions in failure_policy.go.
May 20, 2024
3fee011
Remove parentheses around '[]int'.
May 20, 2024
92fe2f7
Rename 'failurePolicyRule' to 'matchingFailurePolicyRule'.
May 20, 2024
6bd209f
Combine function function call and error check into one line.
May 20, 2024
6a1ac21
Change 'gomega.BeComparableTo' to 'gomega.Equal'.
May 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 54 additions & 4 deletions api/jobset/v1alpha2/jobset_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,11 @@ import (
const (
JobSetNameKey string = "jobset.sigs.k8s.io/jobset-name"
ReplicatedJobReplicas string = "jobset.sigs.k8s.io/replicatedjob-replicas"
ReplicatedJobNameKey string = "jobset.sigs.k8s.io/replicatedjob-name"
JobIndexKey string = "jobset.sigs.k8s.io/job-index"
JobKey string = "jobset.sigs.k8s.io/job-key"
JobNameKey string = "job-name" // TODO(#26): Migrate to the fully qualified label name.
// ReplicatedJobNameKey is used to index into a Jobs labels and retrieve the name of the parent ReplicatedJob
ReplicatedJobNameKey string = "jobset.sigs.k8s.io/replicatedjob-name"
JobIndexKey string = "jobset.sigs.k8s.io/job-index"
JobKey string = "jobset.sigs.k8s.io/job-key"
JobNameKey string = "job-name" // TODO(#26): Migrate to the fully qualified label name.
// ExclusiveKey is an annotation that can be set on the JobSet or on a ReplicatedJob template.
// If set at the JobSet level, all child jobs from all ReplicatedJobs will be scheduled using exclusive
// job placement per topology group (defined as the label value).
Expand Down Expand Up @@ -130,6 +131,9 @@ type JobSetStatus struct {
// Restarts tracks the number of times the JobSet has restarted (i.e. recreated in case of RecreateAll policy).
Restarts int32 `json:"restarts,omitempty"`

// RestartsCountTowardsMax tracks the number of times the JobSet has restarted that counts towards the maximum allowed number of restarts.
RestartsCountTowardsMax int32 `json:"restartsCountTowardsMax,omitempty"`

// ReplicatedJobsStatus track the number of JobsReady for each replicatedJob.
// +optional
// +listType=map
Expand Down Expand Up @@ -229,10 +233,56 @@ const (
OperatorAny Operator = "Any"
)

// FailurePolicyAction defines the action the JobSet controller will take for
// a given FailurePolicyRule.
type FailurePolicyAction string

const (
// Fail the JobSet immediately, regardless of maxRestarts.
FailJobSet FailurePolicyAction = "FailJobSet"

// Restart the JobSet if the number of restart attempts is less than MaxRestarts.
// Otherwise, fail the JobSet.
RestartJobSet FailurePolicyAction = "RestartJobSet"

// Do not count the failure against maxRestarts.
RestartJobSetAndIgnoreMaxRestarts FailurePolicyAction = "RestartJobSetAndIgnoreMaxRestarts"
ahg-g marked this conversation as resolved.
Show resolved Hide resolved
)

// FailurePolicyRule defines a FailurePolicyAction to be executed if a child job
Copy link
Contributor

@danielvegamyhre danielvegamyhre May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is a general comment unrelated to this file) Now that feature gate support has been merged in #557 I think we should add a feature gate (default on) for this feature. If the feature is not enabled, fall back to the current behavior.

We can do this in a follow up PR.

I want to avoid a scenario where we publish the v0.6.0 release and an important customer is using this feature, then they encounter a bug that slipped through the cracks, and we can't simply downgrade to v0.5.0 to mitigate because their JobSet spec (often defined in Python/Go code checked into their codebase) is using fields which only exist in v0.6.0 - thus requiring some emergency rollout on their end to revert their Python/Go code to a spec usable by JobSet v0.5.0, and then downgrade JobSet deployment to v0.5.0.

// fails due to a reason listed in OnJobFailureReasons.
type FailurePolicyRule struct {
// The name of the failure policy rule.
// The name is defaulted to 'failurePolicyRuleN' where N is the index of the failure policy rule.
// The name must match the regular expression "^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$".
Name string `json:"name"`
// The action to take if the rule is matched.
// +kubebuilder:validation:Enum:=FailJobSet;RestartJobSet;RestartJobSetAndIgnoreMaxRestarts
Action FailurePolicyAction `json:"action"`
// The requirement on the job failure reasons. The requirement
// is satisfied if at least one reason matches the list.
// The rules are evaluated in order, and the first matching
// rule is executed.
// An empty list applies the rule to any job failure reason.
// +kubebuilder:validation:UniqueItems:true
OnJobFailureReasons []string `json:"onJobFailureReasons"`
// TargetReplicatedJobs are the names of the replicated jobs the operator applies to.
// An empty list will apply to all replicatedJobs.
// +optional
// +listType=atomic
TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
}

type FailurePolicy struct {
// MaxRestarts defines the limit on the number of JobSet restarts.
// A restart is achieved by recreating all active child jobs.
MaxRestarts int32 `json:"maxRestarts,omitempty"`

// List of failure policy rules for this JobSet.
// For a given Job failure, the rules will be evaluated in order,
// and only the first matching rule will be executed.
// If no matching rule is found, the RestartJobSet action is applied.
Rules []FailurePolicyRule `json:"rules,omitempty"`
}

type SuccessPolicy struct {
Expand Down
89 changes: 89 additions & 0 deletions api/jobset/v1alpha2/openapi_generated.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

34 changes: 33 additions & 1 deletion api/jobset/v1alpha2/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 15 additions & 1 deletion client-go/applyconfiguration/jobset/v1alpha2/failurepolicy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

70 changes: 70 additions & 0 deletions client-go/applyconfiguration/jobset/v1alpha2/failurepolicyrule.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 12 additions & 3 deletions client-go/applyconfiguration/jobset/v1alpha2/jobsetstatus.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions client-go/applyconfiguration/utils.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading