Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(executor): Retry kubectl on transient error #6472

Merged
merged 7 commits into from
Sep 7, 2021

Conversation

blkperl
Copy link
Contributor

@blkperl blkperl commented Aug 2, 2021

Closes #6467

Signed-off-by: William Van Hevelingen william.vanhevelingen@acquia.com
Co-Authored-By: Glenn Pratt glenn.pratt@acquia.com

Checklist:

Tips:

  • Your PR needs to pass the required checks before it can be approved. If the check is not required (e.g. E2E tests) it does not need to pass
  • Sign-off your commits to pass the DCO check: git commit --signoff.
  • Run make pre-commit -B to fix codegen or lint problems.
  • Say how how you tested your changes. If you changed the UI, attach screenshots.

I'm not sure how to test this change.

Closes argoproj#6467

Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
blkperl and others added 2 commits August 4, 2021 12:01
Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can test this with a TRANSIENT_ERROR_PATTERN that captures an intentional error that needs to be retried?

@@ -34,7 +35,11 @@ func (we *WorkflowExecutor) ExecResource(action string, manifestPath string, fla
cmd := exec.Command("kubectl", args...)
log.Info(strings.Join(cmd.Args, " "))

out, err := cmd.Output()
var out []byte
err = retry.OnError(retry.DefaultBackoff, argoerr.IsTransientErr, func() error {
Copy link
Contributor

@alexec alexec Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm... what does this do, run the command X many times?
@sarabala1979 @whynowy I did not know about retry.OnError - better than argowait I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @alexec . argowait should be used. you can enhance argowait with onError() if you want it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry - I actually meant the opposite - we should stop using argowait, and use retry.OnError instead

Copy link
Contributor

@alexec alexec Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example:

	err := waitutil.Backoff(defaultRetry,
		func() (bool, error) {
			log.Infof("GCS List bucekt: %s, key: %s", artifact.GCS.Bucket, artifact.GCS.Key)
			client, err := g.newGCSClient()
			if err != nil {
				log.Warnf("Failed to create new GCS client: %v", err)
				return isTransientGCSErr(err), err
			}
			defer client.Close()
			files, err = listByPrefix(client, artifact.GCS.Bucket, artifact.GCS.Key, "")
			if err != nil {
				return isTransientGCSErr(err), err
			}
			return true, nil
		})

Becomes

	err := retry.OnError(defaultRetry, isTransientGCSErr,
		func() error {
			log.Infof("GCS List bucekt: %s, key: %s", artifact.GCS.Bucket, artifact.GCS.Key)
			client, err := g.newGCSClient()
			if err != nil {
				log.Warnf("Failed to create new GCS client: %v", err)
				return err
			}
			defer client.Close()
			files, err = listByPrefix(client, artifact.GCS.Bucket, artifact.GCS.Key, "")
			if err != nil {
				return err
			}
			return nil
		})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@codecov
Copy link

codecov bot commented Aug 6, 2021

Codecov Report

Merging #6472 (7858ccb) into master (15e1db9) will increase coverage by 0.06%.
The diff coverage is 85.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6472      +/-   ##
==========================================
+ Coverage   48.61%   48.67%   +0.06%     
==========================================
  Files         260      262       +2     
  Lines       18855    18983     +128     
==========================================
+ Hits         9166     9240      +74     
- Misses       8683     8707      +24     
- Partials     1006     1036      +30     
Impacted Files Coverage Δ
util/errors/errors.go 94.73% <77.77%> (-5.27%) ⬇️
workflow/executor/resource.go 38.86% <90.90%> (+6.77%) ⬆️
workflow/sync/chain_throttler.go 68.75% <0.00%> (-31.25%) ⬇️
server/auth/types/claims.go 76.74% <0.00%> (-7.26%) ⬇️
workflow/sync/throttler.go 93.20% <0.00%> (-3.39%) ⬇️
workflow/artifacts/s3/s3.go 36.78% <0.00%> (-2.73%) ⬇️
cmd/argo/commands/get.go 58.30% <0.00%> (-2.05%) ⬇️
workflow/controller/exit_handler.go 68.18% <0.00%> (-2.04%) ⬇️
cmd/argoexec/commands/emissary.go 50.35% <0.00%> (-1.44%) ⬇️
cmd/argo/commands/server.go 30.76% <0.00%> (-1.27%) ⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15e1db9...7858ccb. Read the comment docs.

Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
@blkperl
Copy link
Contributor Author

blkperl commented Aug 9, 2021

It's not working yet. https://gist.github.com/blkperl/524ba2b7b0765c8d289ed3bcb892a292

It didn't detect the error as transient but it did execute the IsTransientErr function because log.Warnf("Non-transient error: %v", err) was run

@alexec
Copy link
Contributor

alexec commented Aug 11, 2021

If it is a work in progress, do you want to put it into draft?

@blkperl blkperl closed this Aug 13, 2021
@blkperl blkperl reopened this Aug 13, 2021
@blkperl

This comment has been minimized.

@alexec alexec marked this pull request as draft August 13, 2021 22:59
blkperl and others added 2 commits September 3, 2021 09:35
Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
Co-Authored-By: Glenn Pratt <glenn.pratt@acquia.com>
@blkperl blkperl marked this pull request as ready for review September 3, 2021 20:14
@blkperl
Copy link
Contributor Author

blkperl commented Sep 3, 2021

🍏 Manual Test 🍏

Test 1: Cluster remains unavailable

k logs k8s-jobs-q4x6z -f
time="2021-09-03T20:15:57.314Z" level=info msg="Starting Workflow Executor" executorType=pns version=untagged
time="2021-09-03T20:15:57.321Z" level=info msg="Creating PNS executor (namespace: default, pod: k8s-jobs-q4x6z, pid: 6)"
time="2021-09-03T20:15:57.321Z" level=info msg="Creating a K8sAPI executor"
time="2021-09-03T20:15:57.321Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=k8s-jobs-q4x6z template="{\"name\":\"main\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"resource\":{\"action\":\"apply\",\"manifest\":\"apiVersion: v1\\nkind: ConfigMap\\nmetadata:\\n  name: test\\n  namespace: default\\ndata:\\n  keys: \\\"test\\\"\\n\"}}" version="&Version{Version:untagged,BuildDate:2021-09-03T20:14:44Z,GitCommit:7858ccb90995fe33d92ddc7fdc67a6b73c63edf3,GitTag:untagged,GitTreeState:clean,GoVersion:go1.16.7,Compiler:gc,Platform:linux/amd64,}"
time="2021-09-03T20:15:57.321Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2021-09-03T20:15:57.330Z" level=info msg="kubectl apply -f /tmp/manifest.yaml -o json"
time="2021-09-03T20:16:27.926Z" level=info msg="Transient error: exit status 1"
time="2021-09-03T20:16:27.938Z" level=info msg="kubectl apply -f /tmp/manifest.yaml -o json"
time="2021-09-03T20:16:58.194Z" level=info msg="Transient error: exit status 1"
time="2021-09-03T20:16:58.250Z" level=info msg="kubectl apply -f /tmp/manifest.yaml -o json"
time="2021-09-03T20:17:28.389Z" level=info msg="Transient error: exit status 1"
time="2021-09-03T20:17:28.656Z" level=info msg="kubectl apply -f /tmp/manifest.yaml -o json"
time="2021-09-03T20:17:58.782Z" level=info msg="Transient error: exit status 1"
time="2021-09-03T20:17:58.782Z" level=error msg="executor error: no more retries Unable to connect to the server: dial tcp 10.152.183.1:443: i/o timeout\ngit.luolix.top/argoproj/argo-workflows/v3/errors.Wrap\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:88\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/executor.(*WorkflowExecutor).ExecResource\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/resource.go:50\ngit.luolix.top/argoproj/argo-workflows/v3/cmd/argoexec/commands.execResource\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/resource.go:48\ngit.luolix.top/argoproj/argo-workflows/v3/cmd/argoexec/commands.NewResourceCommand.func1\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/resource.go:25\ngit.luolix.top/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:846\ngit.luolix.top/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngit.luolix.top/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"
time="2021-09-03T20:17:58.783Z" level=fatal msg="no more retries Unable to connect to the server: dial tcp 10.152.183.1:443: i/o timeout\ngit.luolix.top/argoproj/argo-workflows/v3/errors.Wrap\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:88\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/executor.(*WorkflowExecutor).ExecResource\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/resource.go:50\ngit.luolix.top/argoproj/argo-workflows/v3/cmd/argoexec/commands.execResource\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/resource.go:48\ngit.luolix.top/argoproj/argo-workflows/v3/cmd/argoexec/commands.NewResourceCommand.func1\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/resource.go:25\ngit.luolix.top/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:846\ngit.luolix.top/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngit.luolix.top/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"

Test 2: Cluster recovers half way through the retries

k logs k8s-jobs-fhtth -f
time="2021-09-03T20:20:28.034Z" level=info msg="Starting Workflow Executor" executorType=pns version=untagged
time="2021-09-03T20:20:28.040Z" level=info msg="Creating PNS executor (namespace: default, pod: k8s-jobs-fhtth, pid: 7)"
time="2021-09-03T20:20:28.040Z" level=info msg="Creating a K8sAPI executor"
time="2021-09-03T20:20:28.041Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=k8s-jobs-fhtth template="{\"name\":\"main\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"resource\":{\"action\":\"apply\",\"manifest\":\"apiVersion: v1\\nkind: ConfigMap\\nmetadata:\\n  name: test\\n  namespace: default\\ndata:\\n  keys: \\\"test\\\"\\n\"}}" version="&Version{Version:untagged,BuildDate:2021-09-03T20:14:44Z,GitCommit:7858ccb90995fe33d92ddc7fdc67a6b73c63edf3,GitTag:untagged,GitTreeState:clean,GoVersion:go1.16.7,Compiler:gc,Platform:linux/amd64,}"
time="2021-09-03T20:20:28.041Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2021-09-03T20:20:28.041Z" level=info msg="kubectl apply -f /tmp/manifest.yaml -o json"
time="2021-09-03T20:20:58.159Z" level=info msg="Transient error: exit status 1"
time="2021-09-03T20:20:58.170Z" level=info msg="kubectl apply -f /tmp/manifest.yaml -o json"
time="2021-09-03T20:21:16.163Z" level=info msg="Resource: default/configmap./test. SelfLink: api/v1/namespaces/default/configmaps/test"
time="2021-09-03T20:21:16.163Z" level=info msg="No output parameters"

Test 3: K8s api is stable

k logs k8s-jobs-2rmk8
time="2021-09-03T20:28:57.014Z" level=info msg="Starting Workflow Executor" executorType=pns version=untagged
time="2021-09-03T20:28:57.019Z" level=info msg="Creating PNS executor (namespace: default, pod: k8s-jobs-2rmk8, pid: 6)"
time="2021-09-03T20:28:57.019Z" level=info msg="Creating a K8sAPI executor"
time="2021-09-03T20:28:57.019Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=k8s-jobs-2rmk8 template="{\"name\":\"main\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"resource\":{\"action\":\"apply\",\"manifest\":\"apiVersion: v1\\nkind: ConfigMap\\nmetadata:\\n  name: test\\n  namespace: default\\ndata:\\n  keys: \\\"test\\\"\\n\"}}" version="&Version{Version:untagged,BuildDate:2021-09-03T20:14:44Z,GitCommit:7858ccb90995fe33d92ddc7fdc67a6b73c63edf3,GitTag:untagged,GitTreeState:clean,GoVersion:go1.16.7,Compiler:gc,Platform:linux/amd64,}"
time="2021-09-03T20:28:57.019Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2021-09-03T20:28:57.021Z" level=info msg="kubectl apply -f /tmp/manifest.yaml -o json"
time="2021-09-03T20:29:13.624Z" level=info msg="Resource: default/configmap./test. SelfLink: api/v1/namespaces/default/configmaps/test"
time="2021-09-03T20:29:13.624Z" level=info msg="No output parameters"

@blkperl blkperl requested a review from alexec September 3, 2021 20:31
@alexec alexec changed the title fix: Retry ExecResource on transient error. fix(executor): Retry kubectl on transient error Sep 7, 2021
@alexec alexec merged commit 9c76cc3 into argoproj:master Sep 7, 2021
@blkperl blkperl deleted the fix_6467 branch September 7, 2021 15:34
@sarabala1979 sarabala1979 mentioned this pull request Sep 9, 2021
68 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

commands.NewResourceCommand should retry when k8s api is unavailable
4 participants