-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(executor): Retry kubectl
on transient error
#6472
Conversation
Closes argoproj#6467 Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can test this with a TRANSIENT_ERROR_PATTERN that captures an intentional error that needs to be retried?
@@ -34,7 +35,11 @@ func (we *WorkflowExecutor) ExecResource(action string, manifestPath string, fla | |||
cmd := exec.Command("kubectl", args...) | |||
log.Info(strings.Join(cmd.Args, " ")) | |||
|
|||
out, err := cmd.Output() | |||
var out []byte | |||
err = retry.OnError(retry.DefaultBackoff, argoerr.IsTransientErr, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm... what does this do, run the command X many times?
@sarabala1979 @whynowy I did not know about retry.OnError
- better than argowait I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @alexec . argowait
should be used. you can enhance argowait with onError()
if you want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry - I actually meant the opposite - we should stop using argowait, and use retry.OnError instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example:
err := waitutil.Backoff(defaultRetry,
func() (bool, error) {
log.Infof("GCS List bucekt: %s, key: %s", artifact.GCS.Bucket, artifact.GCS.Key)
client, err := g.newGCSClient()
if err != nil {
log.Warnf("Failed to create new GCS client: %v", err)
return isTransientGCSErr(err), err
}
defer client.Close()
files, err = listByPrefix(client, artifact.GCS.Bucket, artifact.GCS.Key, "")
if err != nil {
return isTransientGCSErr(err), err
}
return true, nil
})
Becomes
err := retry.OnError(defaultRetry, isTransientGCSErr,
func() error {
log.Infof("GCS List bucekt: %s, key: %s", artifact.GCS.Bucket, artifact.GCS.Key)
client, err := g.newGCSClient()
if err != nil {
log.Warnf("Failed to create new GCS client: %v", err)
return err
}
defer client.Close()
files, err = listByPrefix(client, artifact.GCS.Bucket, artifact.GCS.Key, "")
if err != nil {
return err
}
return nil
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Codecov Report
@@ Coverage Diff @@
## master #6472 +/- ##
==========================================
+ Coverage 48.61% 48.67% +0.06%
==========================================
Files 260 262 +2
Lines 18855 18983 +128
==========================================
+ Hits 9166 9240 +74
- Misses 8683 8707 +24
- Partials 1006 1036 +30
Continue to review full report at Codecov.
|
Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
It's not working yet. https://gist.github.com/blkperl/524ba2b7b0765c8d289ed3bcb892a292 It didn't detect the error as transient but it did execute the |
If it is a work in progress, do you want to put it into draft? |
This comment has been minimized.
This comment has been minimized.
Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com> Co-Authored-By: Glenn Pratt <glenn.pratt@acquia.com>
🍏 Manual Test 🍏Test 1: Cluster remains unavailable
Test 2: Cluster recovers half way through the retries
Test 3: K8s api is stable
|
kubectl
on transient error
Closes #6467
Signed-off-by: William Van Hevelingen william.vanhevelingen@acquia.com
Co-Authored-By: Glenn Pratt glenn.pratt@acquia.com
Checklist:
Tips:
git commit --signoff
.make pre-commit -B
to fix codegen or lint problems.I'm not sure how to test this change.