-
Notifications
You must be signed in to change notification settings - Fork 59
Finish simplify retry behavior PR #623
Finish simplify retry behavior PR #623
Conversation
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
maxAttempts = uint32(*nCtx.Node().GetRetryStrategy().MinAttempts) | ||
} | ||
isEligible = currentAttempt < maxAttempts | ||
currentAttempt = nodeStatus.GetAttempts() + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We removed nodeStatus.GetSystemFailures()
. As you can see in
currentAttempt = (nodeStatus.GetAttempts() + 1) - nodeStatus.GetSystemFailures()
(line 813)
GetAttempts()
is the sum of system failures and user failures
When creating the test table we did not notice this as we always only used a single retry for system failures. For our final tests we now did some more extensive search for one off errors.
} | ||
maxAttempts = uint32(config.GetConfig().NodeConfig.DefaultMaxAttempts) | ||
if nCtx.Node().GetRetryStrategy() != nil && nCtx.Node().GetRetryStrategy().MinAttempts != nil && *nCtx.Node().GetRetryStrategy().MinAttempts != 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*nCtx.Node().GetRetryStrategy().MinAttempts != 1 # Changed this 0 -> 1
If a user sets 0
retries, MinAttempts
will be 1
:
https://github.com/fellhorn/flytepropeller/blob/feature/simplify-retries/pkg/compiler/transformers/k8s/utils.go#L24-L25
{"IneligibleSystemRetries", false, 0, 1, 1, 1, core.ExecutionError_SYSTEM, false}, | ||
{"IneligibleSystemRetries", false, 1, 1, 1, 1, core.ExecutionError_SYSTEM, false}, | ||
{"IgnoreCauseEligibleUserRetries", true, 0, 0, 2, 0, core.ExecutionError_USER, true}, | ||
{"IgnoreCauseIneligibleUserRetries", true, 1, 0, 2, 0, core.ExecutionError_USER, false}, | ||
{"IgnoreCauseEligibleSystemRetries", true, 0, 0, 2, 0, core.ExecutionError_SYSTEM, true}, | ||
{"IgnoreCauseIneligibleSystemRetries", true, 0, 1, 2, 0, core.ExecutionError_SYSTEM, false}, | ||
{"IgnoreCauseIneligibleSystemRetries", true, 1, 1, 2, 0, core.ExecutionError_SYSTEM, false}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests were incorrect as attempts must be >= systemFailures
(attempts is sum of user failures and system retries)
interruptible = false | ||
c.metrics.InterruptedThresholdHit.Inc(ctx) | ||
} | ||
} else { | ||
// Else a node is not considered interruptible if the system failures have exceeded the configured threshold | ||
if interruptible && ((c.interruptibleFailureThreshold > 0 && int32(s.GetSystemFailures()) >= c.interruptibleFailureThreshold) || | ||
(c.interruptibleFailureThreshold <= 0 && int32(s.GetSystemFailures()) > int32(c.maxNodeRetriesForSystemFailures)+c.interruptibleFailureThreshold)) { | ||
if interruptible && isAboveInterruptibleFailureThreshold(s.GetSystemFailures(), c.maxNodeRetriesForSystemFailures+1, c.interruptibleFailureThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
because we use >=
and use a shared function isAboveInterruptibleFailureThreshold
which takes maxAttempts
as argument
TL;DR
As discussed with @hamersaw this brings the last finishing touches and fixes some one off bugs, see comments below.
Type
Are all requirements met?