Experiencing issues with the PNS executor #1256

Tomcli · 2019-03-12T16:51:44Z

Hi @jessesuen, we are experimenting with the Argo PNS executor from PR #1214 and running it as the KubeFlow Pipeline backend. The Workflow runs smoothly for most of the containers, except we are experiencing some race condition with the last container in every Workflow. Below are the workflow definition we have and the corresponding error logs from the argoexec.

Failed wait container logs:

$ kubectl logs pipeline-flip-coin-8fn8h-3450342589 -n kubeflow wait
time="2019-03-11T18:44:36Z" level=info msg="Creating PNS executor (namespace: kubeflow, pod: pipeline-flip-coin-8fn8h-3450342589)"
time="2019-03-11T18:44:36Z" level=info msg="Executor (version: v2.3.0+83942fc.dirty, build_date: 2019-03-11T18:38:46Z) initialized with template:\narchiveLocation:\n  archiveLogs: true\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: mlpipeline-minio-artifact\n    bucket: mlpipeline\n    endpoint: minio-service.kubeflow:9000\n    insecure: true\n    key: artifacts/pipeline-flip-coin-8fn8h/pipeline-flip-coin-8fn8h-3450342589\n    secretKeySecret:\n      key: secretkey\n      name: mlpipeline-minio-artifact\ncontainer:\n  command:\n  - echo\n  - tails and 13 <= 15!\n  image: alpine:3.6\n  name: \"\"\n  resources: {}\ninputs:\n  parameters:\n  - name: random-number-2-output\n    value: \"13\"\nmetadata: {}\nname: print-2-3-4\noutputs:\n  artifacts:\n  - name: mlpipeline-ui-metadata\n    path: /mlpipeline-ui-metadata.json\n    s3:\n      accessKeySecret:\n        key: accesskey\n        name: mlpipeline-minio-artifact\n      bucket: mlpipeline\n      endpoint: minio-service.kubeflow:9000\n      insecure: true\n      key: runs/ac598e95-442d-11e9-a431-ba1598362405/pipeline-flip-coin-8fn8h-3450342589/mlpipeline-ui-metadata.tgz\n      secretKeySecret:\n        key: secretkey\n        name: mlpipeline-minio-artifact\n  - name: mlpipeline-metrics\n    path: /mlpipeline-metrics.json\n    s3:\n      accessKeySecret:\n        key: accesskey\n        name: mlpipeline-minio-artifact\n      bucket: mlpipeline\n      endpoint: minio-service.kubeflow:9000\n      insecure: true\n      key: runs/ac598e95-442d-11e9-a431-ba1598362405/pipeline-flip-coin-8fn8h-3450342589/mlpipeline-metrics.tgz\n      secretKeySecret:\n        key: secretkey\n        name: mlpipeline-minio-artifact\n"
time="2019-03-11T18:44:36Z" level=info msg="Waiting on main container"
time="2019-03-11T18:44:36Z" level=warning msg="Polling root processes (1m0s)"
time="2019-03-11T18:44:37Z" level=warning msg="Failed to open /proc/20/root: open /proc/20/root: permission denied"
time="2019-03-11T18:44:37Z" level=warning msg="Failed to open /proc/20/root: open /proc/20/root: permission denied"
time="2019-03-11T18:44:37Z" level=warning msg="Failed to open /proc/20/root: open /proc/20/root: permission denied"
time="2019-03-11T18:44:37Z" level=info msg="main container started with container ID: fbecd8dcbd49bc961e2462fb76552546d7c5d3739faba488ccd6f3158bec7aad"
time="2019-03-11T18:44:37Z" level=info msg="Starting annotations monitor"
time="2019-03-11T18:44:37Z" level=info msg="Starting deadline monitor"
time="2019-03-11T18:44:37Z" level=error msg="executor error: Could not find associated pid for containerID fbecd8dcbd49bc961e2462fb76552546d7c5d3739faba488ccd6f3158bec7aad\ngit.luolix.top/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngit.luolix.top/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:55\ngit.luolix.top/argoproj/argo/errors.InternalErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:65\ngit.luolix.top/argoproj/argo/workflow/executor/pns.(*PNSExecutor).getContainerPID\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:257\ngit.luolix.top/argoproj/argo/workflow/executor/pns.(*PNSExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:147\ngit.luolix.top/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:839\ngit.luolix.top/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:32\ngit.luolix.top/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngit.luolix.top/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngit.luolix.top/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngit.luolix.top/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"
time="2019-03-11T18:44:37Z" level=info msg="No sidecars"
time="2019-03-11T18:44:37Z" level=info msg="Saving logs"
time="2019-03-11T18:44:37Z" level=info msg="Annotations monitor stopped"
time="2019-03-11T18:44:37Z" level=info msg="S3 Save path: /argo/outputs/logs/main.log, key: artifacts/pipeline-flip-coin-8fn8h/pipeline-flip-coin-8fn8h-3450342589/main.log"
time="2019-03-11T18:44:37Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2019-03-11T18:44:37Z" level=info msg="Saving from /argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/pipeline-flip-coin-8fn8h/pipeline-flip-coin-8fn8h-3450342589/main.log)"
time="2019-03-11T18:44:38Z" level=info msg="Deadline monitor stopped"
time="2019-03-11T18:44:39Z" level=info msg="No output parameters"
time="2019-03-11T18:44:39Z" level=info msg="Saving output artifacts"
time="2019-03-11T18:44:39Z" level=info msg="Staging artifact: mlpipeline-ui-metadata"
time="2019-03-11T18:44:39Z" level=info msg="Copying /mlpipeline-ui-metadata.json from container base image layer to /argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2019-03-11T18:44:39Z" level=error msg="executor error: could not chroot into main for artifact collection: container may have exited too quickly\ngit.luolix.top/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngit.luolix.top/argoproj/argo/errors.InternalWrapError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:71\ngit.luolix.top/argoproj/argo/workflow/executor/pns.(*PNSExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:122\ngit.luolix.top/argoproj/argo/workflow/executor.(*WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:329\ngit.luolix.top/argoproj/argo/workflow/executor.(*WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:234\ngit.luolix.top/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:220\ngit.luolix.top/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:54\ngit.luolix.top/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngit.luolix.top/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngit.luolix.top/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngit.luolix.top/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"
time="2019-03-11T18:44:39Z" level=info msg="Alloc=3802 TotalAlloc=14044 Sys=70078 NumGC=6 Goroutines=10"
time="2019-03-11T18:44:39Z" level=fatal msg="could not chroot into main for artifact collection: container may have exited too quickly\ngit.luolix.top/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngit.luolix.top/argoproj/argo/errors.InternalWrapError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:71\ngit.luolix.top/argoproj/argo/workflow/executor/pns.(*PNSExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:122\ngit.luolix.top/argoproj/argo/workflow/executor.(*WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:329\ngit.luolix.top/argoproj/argo/workflow/executor.(*WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:234\ngit.luolix.top/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:220\ngit.luolix.top/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:54\ngit.luolix.top/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngit.luolix.top/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngit.luolix.top/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngit.luolix.top/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"

Workflow yaml file:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: pipeline-flip-coin-
spec:
  arguments:
    parameters: []
  entrypoint: pipeline-flip-coin
  serviceAccountName: pipeline-runner
  templates:
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-output
            value: '{{tasks.random-number.outputs.parameters.random-number-output}}'
        dependencies:
        - random-number
        name: condition-2
        template: condition-2
        when: '{{tasks.random-number.outputs.parameters.random-number-output}} > 5'
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-output
            value: '{{tasks.random-number.outputs.parameters.random-number-output}}'
        dependencies:
        - random-number
        name: condition-3
        template: condition-3
        when: '{{tasks.random-number.outputs.parameters.random-number-output}} <=
          5'
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
        name: random-number
        template: random-number
    inputs:
      parameters:
      - name: flip-output
    name: condition-1
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-output
            value: '{{inputs.parameters.random-number-output}}'
        name: print
        template: print
    inputs:
      parameters:
      - name: flip-output
      - name: random-number-output
    name: condition-2
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-output
            value: '{{inputs.parameters.random-number-output}}'
        name: print-2
        template: print-2
    inputs:
      parameters:
      - name: flip-output
      - name: random-number-output
    name: condition-3
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-2-output
            value: '{{tasks.random-number-2.outputs.parameters.random-number-2-output}}'
        dependencies:
        - random-number-2
        name: condition-5
        template: condition-5
        when: '{{tasks.random-number-2.outputs.parameters.random-number-2-output}}
          > 15'
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-2-output
            value: '{{tasks.random-number-2.outputs.parameters.random-number-2-output}}'
        dependencies:
        - random-number-2
        name: condition-6
        template: condition-6
        when: '{{tasks.random-number-2.outputs.parameters.random-number-2-output}}
          <= 15'
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
        name: random-number-2
        template: random-number-2
    inputs:
      parameters:
      - name: flip-output
    name: condition-4
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-2-output
            value: '{{inputs.parameters.random-number-2-output}}'
        name: print-2-3
        template: print-2-3
    inputs:
      parameters:
      - name: flip-output
      - name: random-number-2-output
    name: condition-5
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-2-output
            value: '{{inputs.parameters.random-number-2-output}}'
        name: print-2-3-4
        template: print-2-3-4
    inputs:
      parameters:
      - name: flip-output
      - name: random-number-2-output
    name: condition-6
  - container:
      args:
      - python -c "import random; result = 'heads' if random.randint(0,1) == 0 else
        'tails'; print(result)" | tee /tmp/output
      command:
      - sh
      - -c
      image: python:alpine3.6
    name: flip
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      parameters:
      - name: flip-output
        valueFrom:
          path: /tmp/output
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{tasks.flip.outputs.parameters.flip-output}}'
        dependencies:
        - flip
        name: condition-1
        template: condition-1
        when: '{{tasks.flip.outputs.parameters.flip-output}} == heads'
      - arguments:
          parameters:
          - name: flip-output
            value: '{{tasks.flip.outputs.parameters.flip-output}}'
        dependencies:
        - flip
        name: condition-4
        template: condition-4
        when: '{{tasks.flip.outputs.parameters.flip-output}} == tails'
      - name: flip
        template: flip
    name: pipeline-flip-coin
  - container:
      command:
      - echo
      - heads and {{inputs.parameters.random-number-output}} > 5!
      image: alpine:3.6
    inputs:
      parameters:
      - name: random-number-output
    name: print
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
  - container:
      command:
      - echo
      - heads and {{inputs.parameters.random-number-output}} <= 5!
      image: alpine:3.6
    inputs:
      parameters:
      - name: random-number-output
    name: print-2
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
  - container:
      command:
      - echo
      - tails and {{inputs.parameters.random-number-2-output}} > 15!
      image: alpine:3.6
    inputs:
      parameters:
      - name: random-number-2-output
    name: print-2-3
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
  - container:
      command:
      - echo
      - tails and {{inputs.parameters.random-number-2-output}} <= 15!
      image: alpine:3.6
    inputs:
      parameters:
      - name: random-number-2-output
    name: print-2-3-4
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
  - container:
      args:
      - python -c "import random; print(random.randint(0,9))" | tee /tmp/output
      command:
      - sh
      - -c
      image: python:alpine3.6
    name: random-number
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      parameters:
      - name: random-number-output
        valueFrom:
          path: /tmp/output
  - container:
      args:
      - python -c "import random; print(random.randint(10,19))" | tee /tmp/output
      command:
      - sh
      - -c
      image: python:alpine3.6
    name: random-number-2
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      parameters:
      - name: random-number-2-output
        valueFrom:
          path: /tmp/output

Related issues: #970
cc: @animeshsingh

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:
Filehandle not being secured before the main container started.

What you expected to happen:
Filehandle should be secured before the main container started.

How to reproduce it (as minimally and precisely as possible):
Run the workflow definition above with the PNS executor

Anything else we need to know?:

Environment:

Argo version: 2.3.0
Kubernetes version : 1.13.4

The text was updated successfully, but these errors were encountered:

Tomcli · 2019-03-14T23:49:04Z

Update:
It looks like the Share Process Namespace from the "wait" container doesn't have enough permission to copy files in "main" for containers in the end of a workflow. My temporary work around is adding SYS_PTRACE and set privileged: true to the "wait" container. However, occasionally I still experiencing some race condition where the "wait" container is trying to copy the output files when it's not ready. Below are the error logs when the race condition happened.

time="2019-03-14T20:29:00Z" level=info msg="Creating PNS executor (namespace: kubeflow, pod: pipeline-flip-coin-fzmf4-745333038)"
time="2019-03-14T20:29:00Z" level=info msg="Executor (version: v2.3.0+83942fc.dirty, build_date: 2019-03-14T20:25:57Z) initialized with template:\narchiveLocation:\n  archiveLogs: true\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: mlpipeline-minio-artifact\n    bucket: mlpipeline\n    endpoint: minio-service.kubeflow:9000\n    insecure: true\n    key: artifacts/pipeline-flip-coin-fzmf4/pipeline-flip-coin-fzmf4-745333038\n    secretKeySecret:\n      key: secretkey\n      name: mlpipeline-minio-artifact\ncontainer:\n  args:\n  - python -c \"import random; result = 'heads' if random.randint(0,1) == 0 else 'tails';\n    print(result)\" | tee /tmp/output\n  command:\n  - sh\n  - -c\n  image: python:alpine3.6\n  name: \"\"\n  resources: {}\ninputs: {}\nmetadata: {}\nname: flip\noutputs:\n  artifacts:\n  - name: mlpipeline-ui-metadata\n    path: /mlpipeline-ui-metadata.json\n    s3:\n      accessKeySecret:\n        key: accesskey\n        name: mlpipeline-minio-artifact\n      bucket: mlpipeline\n      endpoint: minio-service.kubeflow:9000\n      insecure: true\n      key: runs/cb43580d-4697-11e9-a431-ba1598362405/pipeline-flip-coin-fzmf4-745333038/mlpipeline-ui-metadata.tgz\n      secretKeySecret:\n        key: secretkey\n        name: mlpipeline-minio-artifact\n  - name: mlpipeline-metrics\n    path: /mlpipeline-metrics.json\n    s3:\n      accessKeySecret:\n        key: accesskey\n        name: mlpipeline-minio-artifact\n      bucket: mlpipeline\n      endpoint: minio-service.kubeflow:9000\n      insecure: true\n      key: runs/cb43580d-4697-11e9-a431-ba1598362405/pipeline-flip-coin-fzmf4-745333038/mlpipeline-metrics.tgz\n      secretKeySecret:\n        key: secretkey\n        name: mlpipeline-minio-artifact\n  parameters:\n  - name: flip-output\n    valueFrom:\n      path: /tmp/output\n"
time="2019-03-14T20:29:00Z" level=info msg="Waiting on main container"
time="2019-03-14T20:29:00Z" level=warning msg="Polling root processes (1m0s)"
time="2019-03-14T20:29:00Z" level=info msg="Secured filehandle on /proc/18/root"
time="2019-03-14T20:29:00Z" level=info msg="containerID 3130eaf160ff149e0b35aaef5718001026b52413476615e41d82ef6ac92b86f2 mapped to pid 18"
time="2019-03-14T20:29:00Z" level=info msg="main container started with container ID: 3130eaf160ff149e0b35aaef5718001026b52413476615e41d82ef6ac92b86f2"
time="2019-03-14T20:29:00Z" level=info msg="Starting annotations monitor"
time="2019-03-14T20:29:00Z" level=info msg="Main pid identified as 18"
time="2019-03-14T20:29:00Z" level=info msg="Successfully secured file handle on main container root filesystem"
time="2019-03-14T20:29:00Z" level=info msg="Waiting for main pid 18 to complete"
time="2019-03-14T20:29:00Z" level=info msg="Starting deadline monitor"
time="2019-03-14T20:29:00Z" level=info msg="Stopped root processes polling due to successful securing of main root fs"
time="2019-03-14T20:29:01Z" level=info msg="Main pid 18 completed"
time="2019-03-14T20:29:01Z" level=info msg="Main container completed"
time="2019-03-14T20:29:01Z" level=info msg="No sidecars"
time="2019-03-14T20:29:01Z" level=info msg="Saving logs"
time="2019-03-14T20:29:01Z" level=info msg="Deadline monitor stopped"
time="2019-03-14T20:29:01Z" level=info msg="Annotations monitor stopped"
time="2019-03-14T20:29:01Z" level=info msg="S3 Save path: /argo/outputs/logs/main.log, key: artifacts/pipeline-flip-coin-fzmf4/pipeline-flip-coin-fzmf4-745333038/main.log"
time="2019-03-14T20:29:01Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2019-03-14T20:29:01Z" level=info msg="Saving from /argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/pipeline-flip-coin-fzmf4/pipeline-flip-coin-fzmf4-745333038/main.log)"
time="2019-03-14T20:29:03Z" level=info msg="Saving output parameters"
time="2019-03-14T20:29:03Z" level=info msg="Saving path output parameter: flip-output"
time="2019-03-14T20:29:03Z" level=info msg="Copying /tmp/output from base image layer"
time="2019-03-14T20:29:03Z" level=error msg="executor error: open /tmp/output: no such file or directory"
time="2019-03-14T20:29:03Z" level=info msg="Alloc=3858 TotalAlloc=12562 Sys=70078 NumGC=5 Goroutines=9"
time="2019-03-14T20:29:03Z" level=fatal msg="open /tmp/output: no such file or directory"

jessesuen · 2019-04-03T09:34:31Z

@Tomcli - is it possible to construct a smaller, portable workflow which can reproduce this? Also, there's some caveats to PNS that people need to be aware of, which is: collection of artifacts from the base image layer is subject to race conditions when the main container exits too quickly.

Basically the main container needs to be running for a few seconds for the wait sidecar to reliably secure the filehandle on it's root filesystem. If the main container exits too quickly, then the wait sidecar may not have been able to secure the file handle to successfully collect artifacts.

My temporary work around is adding SYS_PTRACE and set privileged: true to the "wait" container. However, occasionally I still experiencing some race condition where the "wait" container is trying to copy the output files when it's not ready

Yes I don't expect privileged mode to help.

However, an alternative workaround is to output the artifacts into an emptyDir volume, mounted in the main container. In v2.3, when a volumes are used, they are now mirrored to the wait sidecar, which eliminates the race with artifact collection, because the wait sidecar has access to the volume long after the main container completed.

jessesuen · 2019-04-05T11:38:57Z

My temporary work around is adding SYS_PTRACE and set privileged: true to the "wait" container.

Actually I'm wrong. SYS_PTRACE is indeed needed when the user id of the main container is different than the wait sidecar.

However, occasionally I still experiencing some race condition where the "wait" container is trying to copy the output files when it's not ready

I'm also experiencing this race condition. Trying to find a solution, but it does seem timing related.

Tomcli · 2019-04-05T17:30:44Z

Hi @jessesuen, Thanks for the reply. Since adding SYS_PTRACE and privileged mode is not that we want and we don't have a better way to work around PNS, right now we switched to use the k8sapi executor with emptyDir as our temporary solution.

jessesuen · 2019-04-05T18:44:51Z

Just to be clear, privileged is unnecessary, but SYS_PTRACE is. The latter is much more secure than having privileged pods.

jessesuen · 2019-04-09T02:13:33Z

@Tomcli I fixed the SYS_PTRACE issue, and also figured out the timing related issue about failing to upload the artifact. PNS should be working much more reliably now in latest version of the PR:

#1214

Given that privileged pods is unnecessary, I think you may want to reconsider PNS.

animeshsingh · 2019-04-09T03:05:11Z

Thanks @jessesuen - we will give it a try. With respect to k8sapi executor - you have a viewpoint? Ideally that should be the solution to use with CRI-O?

jessesuen · 2019-04-09T23:00:50Z

@animeshsingh there are pros and cons to each executor:

Docker:

+ supports all workflow examples
+ most reliable and well tested
+ very scalable. communicates to docker daemon for heavy lifting
- least secure. requires docker.sock of host to be mounted (often rejected by OPA)

kubelet

+ secure. cannot escape privileges of pod's service account
+ medium scalability - log retrieval and container polling is done against kubelet
- additional kubelet configuration may be required
- can only save params/artifacts in volumes (e.g. emptyDir), and not the base image layer (e.g. /tmp)

K8s API

+ secure. cannot escape privileges of pod's service account
+ no extra configuration
- least scalable - log retrieval and container polling is done against k8s API server
- can only save params/artifacts in volumes (e.g. emptyDir), and not the base image layer (e.g. /tmp)

PNS

+ secure. cannot escape privileges of service account
+ artifact collection can be collected from base image layer
+ scalable - process polling is done over procfs and not kubelet/k8s API
- processes will no longer run with pid 1
- artifact collection from base image may fail for containers which complete too fast
- cannot capture artifact directories from base image layer which has a volume mounted under it
- immature

IMO, PNS is the closest thing to the docker executor, without the security concerns, and is what I recommend, except for the fact that it is most immature.

animeshsingh · 2019-04-10T04:27:38Z

Thanks @jessesuen for this comparison. Would the overhead of going through k8s apis bypass the demerits introduced through some randomness using PNS? Given that workflows are expected to be long running jobs, as opposed to a serverless model where bypassing k8s api has its merits vis a vis response time, would it matter too much? Also how important it is to store the artifacts in base image layer?

jessesuen · 2019-04-19T22:49:04Z

My feeling is PNS is the best compromise between security and functionality.

Would the overhead of going through k8s apis bypass the demerits introduced through some randomness using PNS?

The "randomness" of failing to collect artifacts is usually a non-issue unless containers are completing too quick. Even then, you can mitigate this by outputting the artifact to an emptyDir, and this would never be an issue.

Also how important it is to store the artifacts in base image layer?

Not necessary at all. It's just slightly more convenient not to have to define a emptyDir volume to collect artifacts.

Closing bug since PNS has merged.

shimmerjs · 2019-05-20T15:27:08Z

@jessesuen

artifact collection from base image may fail for containers which complete too fast

this is causing a bunch of race conditions in our stuff, should we open a separate issue for this on PNS, or do you have any recommendations on how to deal with it properly?

jessesuen · 2019-05-21T09:41:29Z

@booninite yes. To ensure that the wait sidecar is able to collect outputs, instead of outputting outputs into the base image layer (such as /tmp), output artifacts into an empty dir (which gets mirrored into the wait sidecar). This ensures that the wait sidecar can collect the artifact without subject to timing problems.

aeweidne · 2019-07-02T19:53:12Z

@jessesuen we are still experiencing intermittent artifact passing issues using emptyDir. Does the emptyDir additionally need to be mounted to a path that does not exist in the base image?

animeshsingh · 2019-08-05T22:03:06Z

1. @jessesuen the emptyDir isn't a full proof solution - are there folks actually using PNS executor in real world scenarios?

aeweidne · 2019-08-05T22:14:54Z

We are running ~5k workflows per month that all use PNS. We only see consistent issues with extremely short duration steps, under 15 seconds.

alexec · 2019-12-19T00:42:55Z

https://circleci.com/gh/argoproj/argo/72?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

animeshsingh · 2020-01-27T17:29:52Z

Tying this some other folks raising these issues coming on Kubeflow community
kubeflow/pipelines#1654

Kampe · 2020-06-17T23:27:52Z

I see this same issue trying to pass a single file between my workflows, is the volume mount the solution?

guoweis-work · 2020-07-28T02:54:34Z

yeah, seeing this with pns as well. Not sure what to do here...

ggogel · 2020-08-12T13:10:18Z

Having the same issue. Running K3OS with CR-IO. So I can't use the docker executor. The other two, kubelet and k8api, simply won't work. Kubelet gives me a certificate error, which the helm chart doesn't give an option for ignoring and k8api gives me errors like "function not found"...

alexec · 2020-08-12T17:56:06Z

@sarabala1979 is the workaround for this emptyDir?

ggogel · 2020-08-12T19:30:57Z

I was finally able to get it running using the k8sapi

spec:
  volumes:
    - name: source
      emptyDir: {}

Sadly this breaks the functionality of the built-in git solution, because apparently it can not write into a volume. I had to write my own git clone script. Also this kind of makes the artifact passing redundant, as I could just use this volume in every stage.

stale · 2020-12-08T05:01:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bharathguvvala · 2020-12-08T09:48:00Z

@alexec Is this issue also fixed, as I see #4230 is fixed/closed?

alexec · 2020-12-08T18:43:19Z

Test it and see?

foobarbecue · 2020-12-21T01:08:28Z

I'm experiencing a similar problem on some of my containers, using af03a74 with PNS. Other containers doing almost identical work succeed, and if I keep retrying the workflow everything succeeds eventually. Seems particular to PNS. Here's an example wait container log:

time="2020-12-20T23:40:18.772Z" level=info msg="Waiting on main container"
time="2020-12-20T23:40:24.920Z" level=info msg="main container started with container ID: 053703ca0cd2eceaf68f64286fe641f09f7bedfc3547f887f8f1eca5f8617706"
time="2020-12-20T23:40:24.920Z" level=info msg="Starting annotations monitor"
time="2020-12-20T23:40:24.927Z" level=info msg="containerID 053703ca0cd2eceaf68f64286fe641f09f7bedfc3547f887f8f1eca5f8617706 mapped to pid 41"
time="2020-12-20T23:40:24.927Z" level=info msg="Main pid identified as 41"
time="2020-12-20T23:40:24.927Z" level=warning msg="Failed to secure file handle on main container's root filesystem. Output artifacts from base image layer will fail"
time="2020-12-20T23:40:24.927Z" level=info msg="Waiting for main pid 41 to complete"
time="2020-12-20T23:40:24.927Z" level=info msg="Starting deadline monitor"
time="2020-12-20T23:41:53.967Z" level=info msg="Main pid 41 completed"
time="2020-12-20T23:41:54.006Z" level=info msg="Main container completed"
time="2020-12-20T23:41:54.012Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-12-20T23:41:54.012Z" level=info msg="Capturing script exit code"
time="2020-12-20T23:41:54.018Z" level=info msg="Getting exit code of 053703ca0cd2eceaf68f64286fe641f09f7bedfc3547f887f8f1eca5f8617706"
time="2020-12-20T23:41:54.080Z" level=info msg="Annotations monitor stopped"
time="2020-12-20T23:41:54.118Z" level=info msg="Deadline monitor stopped"
time="2020-12-20T23:41:57.058Z" level=error msg="executor error: could not get container status: timed out waiting for the condition"
time="2020-12-20T23:41:57.058Z" level=info msg="Killing sidecars"
time="2020-12-20T23:41:57.079Z" level=info msg="Alloc=5590 TotalAlloc=15483 Sys=70848 NumGC=5 Goroutines=10"
time="2020-12-20T23:41:57.159Z" level=fatal msg="could not get container status: timed out waiting for the condition"

alexec · 2020-12-21T17:55:18Z

I think timed out waiting for the condition might be a new issue. Has anyone run the v2.11.8 executor vs the v2.12.2 executor? Could be caused by #4253.

alexec · 2020-12-21T17:57:24Z

We do not see Failed to get main PID, so that discounts #4523.

alexec · 2020-12-21T18:02:44Z

OK. Diagnosis - There is a timeout trying to determine if the pod has finished. We allow three attempts at 1-second intervals. The main container has completed (which we use the shared process namespace to determine), but we ask the Kubernetes API for the actual result. The API has not been updated yet. This could be mitigated by increasing the amount of time we allow the executor to poll for on line 375 of pns.go. The core team short-staffed through until 2021. @foobarbecue would you be interested in submitting a fix?

foobarbecue · 2020-12-21T18:11:39Z

@alexec Sure, I can play with the timing and see if I come up with a good PR-worthy solution. Thanks for the detailed analysis.

alexec · 2020-12-21T18:17:20Z

Thank you!

0-duke · 2020-12-22T19:07:44Z

Hi guys,

I was experiencing a lot the same issue recently. Following the comment from @alexec above I've tried to install a previous argo version and everything works well as usual.

The downgraded version I've installed is using workflow-controller and argoexec v2.11.8

argoproj/argo-workflows#1256

alexec · 2021-01-28T20:34:49Z

It appears to me today that in some cases you must grant privileged for PNS to work with output artifacts.

Provider	Works
AWS	Yes
Docker for Desktop	Yes
GCP	No
K3D	Yes
K3S	No

alexec · 2021-01-28T21:17:25Z

Maybe fixed in #4954.

alexec · 2021-02-02T00:54:56Z

v3.0 will have a controller envvar name PNS_PRIVILEGED.

Signed-off-by: meijin <meijin@tiduyun.com> Co-authored-by: Derek Wang <whynowy@gmail.com>

jessesuen closed this as completed Apr 19, 2019

animeshsingh mentioned this issue Apr 29, 2019

Parameterize containerRuntimeExecutor on argo template kubeflow/kubeflow#2807

Merged

logicfox mentioned this issue Oct 2, 2019

Can't run example argo submit artifact-passing.yaml #1629

Closed

dcherman mentioned this issue Nov 14, 2019

Create documentation for executors #1767

Closed

animeshsingh mentioned this issue Jan 27, 2020

Support for non-docker based deployments kubeflow/pipelines#1654

Closed

emichaf mentioned this issue Apr 14, 2020

How to configure Argo for container runtime containerd? #2685

Closed

sachua mentioned this issue May 2, 2020

Pipelines pods FailedMount kubeflow/pipelines#3407

Closed

Ark-kun mentioned this issue Jul 23, 2020

Offical artifact passing example fails with PNS executor kubeflow/pipelines#4257

Closed

mszostok mentioned this issue Oct 5, 2020

Expand section of running Argo on kind/k3d with different executors capactio/capact#5

Merged

alexec mentioned this issue Oct 7, 2020

PNS Executor input/output artifacts "Failed to determine pid for containerID" #4230

Closed

alexec reopened this Oct 7, 2020

alexec added the executor/pns label Oct 9, 2020

stale bot added the wontfix label Dec 8, 2020

stale bot removed the wontfix label Dec 8, 2020

alexec added the epic/reliability label Dec 22, 2020

foobarbecue added a commit to nasa-jpl/sstmp that referenced this issue Dec 23, 2020

Downgrade argo due to PNS issue

ce35e6e

argoproj/argo-workflows#1256

alexec mentioned this issue Dec 29, 2020

fix(controller): Various v2.12 fixes. Fixes #4798, #4801, #4806 #4808

Merged

1 task

This was linked to pull requests Feb 2, 2021

fix(controller): Adds PNS_PRIVILEGED, fixed termination bug #4983

Merged

feat(executor): Minimize the number of Kubernetes API requests made by executors #4954

Merged

alexec closed this as completed Feb 2, 2021

alexec added this to the v3.0 milestone Feb 2, 2021

Bobgy mentioned this issue Mar 11, 2021

Orchestration - Flakiness in small samples when using PNS executor kubeflow/pipelines#5285

Closed

knkski mentioned this issue Jun 9, 2021

Example pipeline err - XGBoost - Iterative model training canonical/kfp-operators#8

Closed

odellus mentioned this issue Dec 6, 2021

Issues with PNSExecutor and Argo kubeflow/kubeflow#6234

Closed

icecoffee531 pushed a commit to icecoffee531/argo-workflows that referenced this issue Jan 5, 2022

add link to examples and adjust punctuation (argoproj#1256)

b5e5d8e

Signed-off-by: meijin <meijin@tiduyun.com> Co-authored-by: Derek Wang <whynowy@gmail.com>

Experiencing issues with the PNS executor #1256

Experiencing issues with the PNS executor #1256

Comments

Tomcli commented Mar 12, 2019

Tomcli commented Mar 14, 2019 • edited Loading

jessesuen commented Apr 3, 2019

jessesuen commented Apr 5, 2019 • edited Loading

Tomcli commented Apr 5, 2019

jessesuen commented Apr 5, 2019

jessesuen commented Apr 9, 2019

animeshsingh commented Apr 9, 2019

jessesuen commented Apr 9, 2019

animeshsingh commented Apr 10, 2019 • edited Loading

jessesuen commented Apr 19, 2019

shimmerjs commented May 20, 2019

jessesuen commented May 21, 2019

aeweidne commented Jul 2, 2019

animeshsingh commented Aug 5, 2019

aeweidne commented Aug 5, 2019

alexec commented Dec 19, 2019

animeshsingh commented Jan 27, 2020

Kampe commented Jun 17, 2020

guoweis-work commented Jul 28, 2020

ggogel commented Aug 12, 2020 • edited Loading

alexec commented Aug 12, 2020

ggogel commented Aug 12, 2020 • edited Loading

stale bot commented Dec 8, 2020

bharathguvvala commented Dec 8, 2020

alexec commented Dec 8, 2020

foobarbecue commented Dec 21, 2020 • edited Loading

alexec commented Dec 21, 2020 • edited Loading

alexec commented Dec 21, 2020 • edited Loading

alexec commented Dec 21, 2020

foobarbecue commented Dec 21, 2020 • edited Loading

alexec commented Dec 21, 2020

0-duke commented Dec 22, 2020

alexec commented Jan 28, 2021 • edited Loading

alexec commented Jan 28, 2021

alexec commented Feb 2, 2021

Tomcli commented Mar 14, 2019 •

edited

Loading

jessesuen commented Apr 5, 2019 •

edited

Loading

animeshsingh commented Apr 10, 2019 •

edited

Loading

ggogel commented Aug 12, 2020 •

edited

Loading

ggogel commented Aug 12, 2020 •

edited

Loading

foobarbecue commented Dec 21, 2020 •

edited

Loading

alexec commented Dec 21, 2020 •

edited

Loading

alexec commented Dec 21, 2020 •

edited

Loading

foobarbecue commented Dec 21, 2020 •

edited

Loading

alexec commented Jan 28, 2021 •

edited

Loading