Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(executor): fix GCS artifact retry #6302

Merged
merged 4 commits into from
Jul 23, 2021

Conversation

tczhao
Copy link
Member

@tczhao tczhao commented Jul 8, 2021

The current gcs artifact exponentialbackoff is not working.

  • fix retry
  • added only retry on transient error

gcs bucket accepts about 1000 write request initially, then autoscales up. link
When having over 1000 write request in a burst from stale, we will see some errors.

Tested with below workflows/config on gke:1.18.17-gke.1901 with

  • argo-workflow 3.1.1 + gcloud bucket1
  • argo-workflow 3.1.1 + gcloud bucket2 + this PR
Workflows
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: stress-test-
  labels:
    workflows.argoproj.io/container-runtime-executor: emissary
spec:
  entrypoint: loops-dag
  podGC:
    strategy: OnWorkflowSuccess
  templates:
  - name: loops-dag
    dag:
      tasks:
      - name: A
        template: gen-process-indices
      - name: B
        dependencies: [A]
        template: whalesay
        arguments:
          parameters:
          - {name: message, value: "{{item}}"}
        withParam: "{{tasks.A.outputs.result}}"
      - name: C
        dependencies: [B]
        template: whalesay
        arguments:
          parameters:
          - {name: message, value: C}
  - name: whalesay
    inputs:
      parameters:
      - name: message
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]
  - name: gen-process-indices
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import json
        import sys
        json.dump([i for i in range(0, 2000)], sys.stdout)
---
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: artifact-passing-
spec:
  entrypoint: artifact-example
  templates:
  - name: artifact-example
    steps:
    - - name: generate-artifact
        template: whalesay
    - - name: consume-artifact
        template: print-message
        arguments:
          artifacts:
          # bind message to the hello-art artifact
          # generated by the generate-artifact step
          - name: message
            from: "{{steps.generate-artifact.outputs.artifacts.hello-art}}"

  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world | tee /tmp/hello_world.txt"]
    outputs:
      artifacts:
      # generate hello-art artifact from /tmp/hello_world.txt
      # artifacts can be directories as well as files
      - name: hello-art
        path: /tmp/hello_world.txt

  - name: print-message
    inputs:
      artifacts:
      # unpack the message input artifact
      # and put it at /tmp/message
      - name: message
        path: /tmp/message
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["cat /tmp/message"]
Config
    artifactRepository:
      archiveLogs: true
      gcs:
        bucket: bucket1
        keyFormat: "pipeline/log/{{workflow.name}}/{{pod.name}}"
        serviceAccountKeySecret:
          name: service-ac-key
          key: service-ac-key.json
    artifactRepository:
      archiveLogs: true
      gcs:
        bucket: bucket2
        keyFormat: "pipeline/log/{{workflow.name}}/{{pod.name}}"
        serviceAccountKeySecret:
          name: service-ac-key
          key: service-ac-key.json

argo-workflow 3.1.1 + gcloud bucket1

produce following error on wait container

wait container log
time="2021-07-15T15:27:29.299Z" level=info msg="Starting annotations monitor"
time="2021-07-15T15:27:29.299Z" level=info msg="Starting deadline monitor"
time="2021-07-15T15:28:53.315Z" level=info msg="Main container completed"
time="2021-07-15T15:28:53.315Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-07-15T15:28:53.315Z" level=info msg="Saving logs"
time="2021-07-15T15:28:53.336Z" level=info msg="GCS Save path: /tmp/argo/outputs/logs/main.log, key: pipeline/log/stress-test-2ztc4/stress-test-2ztc4-3
865573343/main.log"
time="2021-07-15T15:29:03.425Z" level=error msg="executor error: upload /tmp/argo/outputs/logs/main.log: writer close: Post \"https://storage.googleapi
s.com/upload/storage/v1/b/bucket1/o?alt=json&name=pipeline%2Flog%2Fstress-test-2ztc4%2Fstress-test-2ztc4-3865573343%2Fmain.log&prettyPrint
=false&projection=full&uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": net/http: TLS handshake timeou
t"
time="2021-07-15T15:29:03.425Z" level=info msg="No output parameters"
time="2021-07-15T15:29:03.425Z" level=info msg="No output artifacts"
time="2021-07-15T15:29:03.425Z" level=info msg="Killing sidecars []"
time="2021-07-15T15:29:03.426Z" level=info msg="Alloc=24648 TotalAlloc=29651 Sys=73553 NumGC=4 Goroutines=9"
time="2021-07-15T15:29:03.455Z" level=fatal msg="upload /tmp/argo/outputs/logs/main.log: writer close: Post \"https://storage.googleapis.com/upload/sto
rage/v1/b/bucket1/o?alt=json&name=pipeline%2Flog%2Fstress-test-2ztc4%2Fstress-test-2ztc4-3865573343%2Fmain.log&prettyPrint=false&projectio
n=full&uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": net/http: TLS handshake timeout"

argo-workflow 3.1.1 + gcloud bucket2 + this PR

workflow successfully finish
we can see that there's retry from the log

wait container log
time="2021-07-17T13:19:59.647Z" level=info msg="Starting Workflow Executor" executorType=emissary version=v3.1.1+a245fe6.dirty
time="2021-07-17T13:19:59.664Z" level=info msg="Executor initialized" includeScriptOutput=false namespace=default podName=stress-test-f65vd-1802293675 template="{\"name\":\"whalesay\",\"inputs\":{\"parameters\":[{\"name\":\"message\",\"value\":\"1857\"}]},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"docker/whalesay:latest\",\"command\":[\"cowsay\"],\"args\":[\"1857\"],\"resources\":{}},\"archiveLocation\":{\"archiveLogs\":true,\"gcs\":{\"bucket\":\"wx-lty-mmm-super-dev\",\"serviceAccountKeySecret\":{\"name\":\"service-ac-key\",\"key\":\"service-ac-key.json\"},\"key\":\"pipeline/log/stress-test-f65vd/stress-test-f65vd-1802293675\"}}}" version="&Version{Version:v3.1.1+a245fe6.dirty,BuildDate:2021-07-17T13:11:50Z,GitCommit:a245fe67db56d2808fb78c6079d08404cbee91aa,GitTag:v3.1.1,GitTreeState:dirty,GoVersion:go1.15.7,Compiler:gc,Platform:linux/amd64,}"
time="2021-07-17T13:19:59.664Z" level=info msg="Starting annotations monitor"
time="2021-07-17T13:19:59.666Z" level=info msg="Starting deadline monitor"
time="2021-07-17T13:20:19.669Z" level=info msg="Main container completed"
time="2021-07-17T13:20:19.669Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-07-17T13:20:19.669Z" level=info msg="Saving logs"
time="2021-07-17T13:20:19.749Z" level=info msg="GCS Save path: /tmp/argo/outputs/logs/main.log, key: pipeline/log/stress-test-f65vd/stress-test-f65vd-1802293675/main.log"
time="2021-07-17T13:20:31.870Z" level=info msg="GCS Save path: /tmp/argo/outputs/logs/main.log, key: pipeline/log/stress-test-f65vd/stress-test-f65vd-1802293675/main.log"
time="2021-07-17T13:20:32.032Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2021-07-17T13:20:32.032Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2021-07-17T13:20:32.032Z" level=info msg="No output parameters"
time="2021-07-17T13:20:32.032Z" level=info msg="No output artifacts"
time="2021-07-17T13:20:32.032Z" level=info msg="Annotating pod with output"
time="2021-07-17T13:20:32.053Z" level=info msg="Patch pods 200"
time="2021-07-17T13:20:32.063Z" level=info msg="Killing sidecars []"
time="2021-07-17T13:20:32.063Z" level=info msg="Alloc=23466 TotalAlloc=50845 Sys=73297 NumGC=5 Goroutines=11"

Checklist:

Tips:

  • Your PR needs to pass the required checks before it can be approved. If the check is not required (e.g. E2E tests) it does not need to pass
  • Sign-off your commits to pass the DCO check: git commit --signoff.
  • Run make pre-commit -B to fix codegen or lint problems.
  • Say how how you tested your changes. If you changed the UI, attach screenshots.

@codecov
Copy link

codecov bot commented Jul 8, 2021

Codecov Report

Merging #6302 (42ba3ec) into master (c80f4bc) will decrease coverage by 0.38%.
The diff coverage is 69.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6302      +/-   ##
==========================================
- Coverage   49.18%   48.80%   -0.39%     
==========================================
  Files         252      253       +1     
  Lines       17976    18151     +175     
==========================================
+ Hits         8842     8859      +17     
- Misses       8166     8325     +159     
+ Partials      968      967       -1     
Impacted Files Coverage Δ
workflow/artifacts/gcs/gcs.go 9.14% <69.56%> (ø)
workflow/metrics/server.go 15.78% <0.00%> (-3.51%) ⬇️
workflow/controller/operator.go 72.45% <0.00%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c80f4bc...42ba3ec. Read the comment docs.

@simster7
Copy link
Member

simster7 commented Jul 8, 2021

Is there an issue associated with this PR?

@tczhao
Copy link
Member Author

tczhao commented Jul 9, 2021

Hi @simster7,

Sorry I didn't create any issue for this.
I've added additional details in PR description.

Let me know if an issue is required

Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add comments on how you tested this?

@tczhao tczhao marked this pull request as draft July 15, 2021 16:13
Signed-off-by: Tianchu Zhao <evantczhao@gmail.com>
@tczhao tczhao force-pushed the feature/update-gcloud-api-packages branch from e4c0073 to 7ad386a Compare July 17, 2021 13:49
@tczhao tczhao marked this pull request as ready for review July 17, 2021 14:18
@tczhao
Copy link
Member Author

tczhao commented Jul 17, 2021

@alexec sorry, the original approach was incorrect,
I've updated the code
and added how I tested it in the description

@tczhao tczhao changed the title feat(go.mod): update gcloud packages, resolve gcloud storage not retry on transient error fix(executor): fix gcs artifact retry Jul 17, 2021
Signed-off-by: Tianchu Zhao <evantczhao@gmail.com>
@tczhao tczhao force-pushed the feature/update-gcloud-api-packages branch from 2c902eb to 54ec62c Compare July 18, 2021 04:54
@tczhao tczhao requested a review from alexec July 22, 2021 11:57
@alexec alexec changed the title fix(executor): fix gcs artifact retry fix(executor): fix GCS artifact retry Jul 22, 2021
@alexec
Copy link
Contributor

alexec commented Jul 22, 2021

@tczhao this looks good. I won't be doing any testing on this, and we do not have any automated e2e tests for this. Can you please add some notes about how you tested it?

@tczhao
Copy link
Member Author

tczhao commented Jul 22, 2021

@tczhao this looks good. I won't be doing any testing on this, and we do not have any automated e2e tests for this. Can you please add some notes about how you tested it?

Hi @alexec, the tests are included in the PR description, covers read and write scenario.
Let me know if I need to add more

@alexec alexec enabled auto-merge (squash) July 22, 2021 23:39
@alexec alexec merged commit 2df5f66 into argoproj:master Jul 23, 2021
This was referenced Jul 27, 2021
sarabala1979 pushed a commit that referenced this pull request Aug 3, 2021
Signed-off-by: Tianchu Zhao <evantczhao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants