Wait container stuck in a loop while main container finished successfully #6115

avifried1 · 2021-06-09T19:26:56Z

Summary

Wait container is occasionally stuck, workflow keeps running until it times out, while the main pod has finished successfully.

wait container log:

time="2021-06-09T18:01:47.217Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1623258000-2189159064"
time="2021-06-09T18:01:47.256Z" level=info msg="ignoring container \"main\" created at 2021-06-09 17:00:04 +0000 UTC, too long before process started"
time="2021-06-09T18:01:47.257Z" level=info msg="ignoring container \"wait\" created at 2021-06-09 17:00:03 +0000 UTC, too long before process started"
time="2021-06-09T18:01:48.257Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1623258000-2189159064"
time="2021-06-09T18:01:48.307Z" level=info msg="ignoring container \"main\" created at 2021-06-09 17:00:04 +0000 UTC, too long before process started"
time="2021-06-09T18:01:48.307Z" level=info msg="ignoring container \"wait\" created at 2021-06-09 17:00:03 +0000 UTC, too long before process started"

pod is in Completed status, main container is in terminated - Completed (exit code: 0) status, wait container is in running, ready status. And the workflow (it's a multi-step DAG workflow) hangs on running (eventually terminating after exceeding active deadline).
The main container is a very simple bash step of recording timestamp (sh -c 'date +%s > /tmp/timestamp.txt) that takes perhaps seconds to complete.

Diagnostics

Running self-managed Kubernetes on AWS.
Argo version 3.1.0-rc12
Docker executer

We've had this issue from time to time, we can't pinpoint the version in which we started seeing it.
Examining controller logs with the wf name in them:

time="2021-06-09T17:00:00.066Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.105Z" level=info msg="Updated phase -> Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.108Z" level=info msg="Retry nodeworkflow-analytics-pipeline-cron-1623258000 initialized Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.109Z" level=info msg="All of nodeworkflow-analytics-pipeline-cron-1623258000(0).generate-timestamp dependencies [] completed" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.109Z" level=info msg="DAG node workflow-analytics-pipeline-cron-1623258000-1105574991 initialized Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.110Z" level=info msg="Pod node workflow-analytics-pipeline-cron-1623258000-2189159064 initialized Pending" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.110Z" level=info msg="Retry node workflow-analytics-pipeline-cron-1623258000-2617208257 initialized Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.134Z" level=info msg="Created pod: workflow-analytics-pipeline-cron-1623258000(0).generate-timestamp(0) (workflow-analytics-pipeline-cron-1623258000-2189159064)" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.157Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=165964949 workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:10.068Z" level=info msg="Updating node workflow-analytics-pipeline-cron-1623258000-2189159064 message: ContainerCreating" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:10.084Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=165965126 workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:20.086Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:32.164Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:32.169Z" level=info msg="Updating node workflow-analytics-pipeline-cron-1623258000-2189159064 status Pending -> Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:32.416Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=165965421 workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:42.418Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:20:42.420Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:40:42.421Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T18:00:00.027Z" level=info msg="stopping 'workflow-analytics-pipeline-cron-1623258000'" namespace=workflows workflow=workflow-analytics-pipeline-cron
time="2021-06-09T18:00:10.049Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T18:00:10.055Z" level=info msg="Applying shutdown deadline for pod workflow-analytics-pipeline-cron-1623258000-2189159064" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

alexec · 2021-06-09T19:29:01Z

Excellent. Thank you. Can I please ask you to test the argoexec:dev-docker image? I think this should fix your issue.

avifried1 · 2021-06-09T19:33:59Z

what do you suggest I test for? We weren't able to reproduce this every time, it just happens sporadically.

alexec · 2021-06-09T20:31:57Z

can you soak it and see?

avifried1 · 2021-06-15T08:15:54Z

This issue is still happening in v3.1.0-rc14
Main container status is terminated - Completed (exit code: 0), Wait container is still in running, ready Pod status is Completed, but workflow still running.

wait pod stuck in loop:

time="2021-06-15T08:08:21.528Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1623740400-3832141026"
time="2021-06-15T08:08:21.570Z" level=info msg="ignoring container \"main\" created at 2021-06-15 07:00:08 +0000 UTC, too long before process started"
time="2021-06-15T08:08:21.570Z" level=info msg="ignoring container \"wait\" created at 2021-06-15 07:00:04 +0000 UTC, too long before process started"

Running the following command manually inside the pod:

docker ps --all --no-trunc --format='{{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}}' --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1623740400-3832141026

Exited (0) About an hour ago|main|fd3f4355ca5538e0dbfa449c5f1c6a2d4b9dc913eb76527bdefc25cf7e0cafc2|2021-06-15 07:00:08 +0000 UTC
Up About an hour|wait|e7adf5bf1b185788a8d1418d76a05f841bc720a5ba0ffdd2a4001c743941e15a|2021-06-15 07:00:04 +0000 UTC
Up About an hour|POD|d64fc515554fa0cbbc10e31a015f18c6d5983506052366c930f56783cae300c7|2021-06-15 07:00:00 +0000 UTC

Would note that this workflow uses the retry strategy:

retryStrategy:
      limit: 1
      retryPolicy: OnError

alexec · 2021-06-15T16:05:31Z

This was not fixed.

alexec · 2021-06-15T17:33:47Z

@avifried1 can you please attach the full logs? I don't have quite enough logs to diagnose.

zhongwenqin · 2021-07-03T01:25:05Z

I have the same problem

zhongwenqin · 2021-07-03T01:26:09Z

[root@host1 ~]# kubectl -n argo logs hello-world-xr5lw wait
time="2021-07-03T01:14:10.519Z" level=info msg="Starting Workflow Executor" version="{v3.0.3 2021-05-11T21:14:20Z 0207105 v3.0.3 clean go1.15.7 gc linux/amd64}"
time="2021-07-03T01:14:10.528Z" level=info msg="Creating a docker executor"
time="2021-07-03T01:14:10.528Z" level=info msg="Executor (version: v3.0.3, build_date: 2021-05-11T21:14:20Z) initialized (pod: argo/hello-world-xr5lw) with template:\n{"name":"whalesay","inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"docker/whalesay:latest","command":["cowsay"],"args":["hello world"],"resources":{}},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"hello-world-xr5lw/hello-world-xr5lw"}}}"
time="2021-07-03T01:14:10.528Z" level=info msg="Starting annotations monitor"
time="2021-07-03T01:14:10.529Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"
time="2021-07-03T01:14:10.529Z" level=info msg="Starting deadline monitor"
time="2021-07-03T01:14:11.594Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"
time="2021-07-03T01:14:12.661Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"
time="2021-07-03T01:14:13.724Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"
time="2021-07-03T01:14:14.817Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"
time="2021-07-03T01:14:15.877Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"

zhongwenqin · 2021-07-03T01:26:45Z

[root@host1 ~]# kubectl -n argo logs hello-world-xr5lw main

< hello world >

\
 \
  \     
                ##        .            
          ## ## ##       ==            
       ## ## ## ##      ===            
   /""""""""""""""""___/ ===

     \______ o          __/            
      \    \        __/             
        \____\______/

zhongwenqin · 2021-07-03T01:29:02Z

[root@host1 ~]# kubectl -n argo get pod
NAME READY STATUS RESTARTS AGE
argo-server-594b5c8cb6-bjtbl 1/1 Running 0 19h
hello-world-xr5lw 1/2 NotReady 0 13m
minio-77d6796f8d-p9bjj 1/1 Running 0 9d
postgres-546d9d68b-n7ptb 1/1 Running 0 9d
workflow-controller-df6fb694c-p5fvc 1/1 Running 2 9d
[root@host1 ~]# kubectl -n argo describe pod hello-world-xr5lw
Name: hello-world-xr5lw
Namespace: argo
Priority: 0
Node: 11.0.2.157/11.0.2.157
Start Time: Sat, 03 Jul 2021 09:14:09 +0800
Labels: workflows.argoproj.io/completed=false
workflows.argoproj.io/workflow=hello-world-xr5lw
Annotations: workflows.argoproj.io/node-name: hello-world-xr5lw
workflows.argoproj.io/template:
{"name":"whalesay","inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"docker/whalesay:latest","command":["cowsay"],"ar...
Status: Running
IP: 10.244.1.207
IPs:
IP: 10.244.1.207
Controlled By: Workflow/hello-world-xr5lw
Containers:
wait:
Container ID: containerd://707783fb47ba124b65e408a9e263c727f9518ed6d90cf9863ad31ae943d166c8
Image: argoproj/argoexec:v3.0.3
Image ID: docker.io/argoproj/argoexec@sha256:cf79ea62b8c9bf23bc5f5fd4dd1027d89c69353a83f62dd9d72d3e06f2e4c7a8
Port:
Host Port:
Command:
argoexec
wait
--loglevel
info
State: Running
Started: Sat, 03 Jul 2021 09:14:10 +0800
Ready: True
Restart Count: 0
Environment:
ARGO_POD_NAME: hello-world-xr5lw (v1:metadata.name)
GODEBUG: x509ignoreCN=0
ARGO_CONTAINER_NAME: wait
Mounts:
/argo/podmetadata from podmetadata (rw)
/argo/secret/my-minio-cred from my-minio-cred (ro)
/var/run/docker.sock from docker-sock (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jfqfk (ro)
main:
Container ID: containerd://73d5862ea4fb2e605798c2d1cde2ec4af74876b720bfd09cc16744bfafc8ac99
Image: docker/whalesay:latest
Image ID: sha256:c717279bbba020bf95ac72cf47b2c8abb3a383ad4b6996c1a7a9f2a7aaa480ad
Port:
Host Port:
Command:
cowsay
Args:
hello world
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 03 Jul 2021 09:14:15 +0800
Finished: Sat, 03 Jul 2021 09:14:15 +0800
Ready: False
Restart Count: 0
Environment:
ARGO_CONTAINER_NAME: main
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jfqfk (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
podmetadata:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
docker-sock:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType: Socket
my-minio-cred:
Type: Secret (a volume populated by a Secret)
SecretName: my-minio-cred
Optional: false
kube-api-access-jfqfk:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message

Normal Scheduled 14m default-scheduler Successfully assigned argo/hello-world-xr5lw to 11.0.2.157
Normal Pulled 14m kubelet Container image "argoproj/argoexec:v3.0.3" already present on machine
Normal Created 14m kubelet Created container wait
Normal Started 14m kubelet Started container wait
Normal Pulling 14m kubelet Pulling image "docker/whalesay:latest"
Normal Pulled 14m kubelet Successfully pulled image "docker/whalesay:latest" in 5.1757762s
Normal Created 14m kubelet Created container main
Normal Started 14m kubelet Started container main

avifried1 · 2021-07-03T02:52:00Z

wait container logs are basically an endless loop of:

time="2021-07-01T23:00:04.335Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1625173200-1458802104"
time="2021-07-01T23:00:04.379Z" level=info msg="ignoring container \"main\" created at 2021-07-01 21:01:02 +0000 UTC, too long before process started"
time="2021-07-01T23:00:04.379Z" level=info msg="ignoring container \"wait\" created at 2021-07-01 21:00:56 +0000 UTC, too long before process started"
time="2021-07-01T23:00:04.379Z" level=info msg="ignoring container \"init\" created at 2021-07-01 21:00:37 +0000 UTC, too long before process started"

raviatmaryh · 2021-07-06T18:19:54Z

Had the same issue as others. main container finishes successfully. wait container keeps looping with docker ps...

Did not see this issue on local minikube cluster, but ran into it on digitalocean cluster.

For now, I've switched the executor to k8sapi and I'm not seeing any issues.

zhongwenqin · 2021-07-07T04:19:02Z

@raviatmaryh thank you ,the issue has been solve.

avifried1 · 2021-07-17T00:26:11Z

not sure why this was closed, the issue hasn't been solved - it has been circumvented by using k8sapin executor. Using the default Docker executor still has this issue.

alexec · 2021-12-08T00:20:29Z

The Docker executor is being sunset. You should switch to using either PNS or Emissary:

https://blog.argoproj.io/sunsetting-docker-executor-support-in-argo-workflows-77d352d601c?source=collection_home---------3-------------------------------

stale · 2022-02-07T04:48:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

alexec · 2022-02-21T17:16:54Z

I think this has been fixed.

avifried1 added the type/bug label Jun 9, 2021

alexec linked a pull request Jun 9, 2021 that will close this issue

fix(executor): Fix docker not terminating. Fixes #6064 #6083

Merged

alexec added the executor/docker label Jun 9, 2021

alexec closed this as completed in #6083 Jun 9, 2021

alexec reopened this Jun 15, 2021

alexec added this to the v3.0 milestone Jun 15, 2021

alexec self-assigned this Jun 15, 2021

alexec added the more-information-needed label Jun 21, 2021

sarabala1979 modified the milestones: v3.0, v3.2 Jun 30, 2021

no-response bot removed the more-information-needed label Jul 3, 2021

alexec closed this as completed Jul 14, 2021

avifried1 mentioned this issue Nov 14, 2021

Wait container stuck in a loop while main container finished successfully #7213

Closed

alexec reopened this Dec 8, 2021

alexec added the try-newer-version label Dec 8, 2021

alexec removed their assignment Jan 24, 2022

alexec removed this from the 2021 Q3 milestone Jan 24, 2022

stale bot added the problem/stale This has not had a response in some time label Feb 7, 2022

github-actions bot added needs/kind labels Feb 7, 2022

alexec removed needs/component labels Feb 7, 2022

stale bot removed the problem/stale This has not had a response in some time label Feb 7, 2022

alexec added area/executor and removed try-newer-version labels Feb 15, 2022

alexec closed this as completed Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait container stuck in a loop while main container finished successfully #6115

Wait container stuck in a loop while main container finished successfully #6115

avifried1 commented Jun 9, 2021

alexec commented Jun 9, 2021

avifried1 commented Jun 9, 2021

alexec commented Jun 9, 2021

avifried1 commented Jun 15, 2021 •

edited

Loading

alexec commented Jun 15, 2021

alexec commented Jun 15, 2021

zhongwenqin commented Jul 3, 2021

zhongwenqin commented Jul 3, 2021

zhongwenqin commented Jul 3, 2021

zhongwenqin commented Jul 3, 2021

avifried1 commented Jul 3, 2021

raviatmaryh commented Jul 6, 2021

zhongwenqin commented Jul 7, 2021

avifried1 commented Jul 17, 2021

alexec commented Dec 8, 2021

stale bot commented Feb 7, 2022

alexec commented Feb 21, 2022

Wait container stuck in a loop while main container finished successfully #6115

Wait container stuck in a loop while main container finished successfully #6115

Comments

avifried1 commented Jun 9, 2021

Summary

Diagnostics

alexec commented Jun 9, 2021

avifried1 commented Jun 9, 2021

alexec commented Jun 9, 2021

avifried1 commented Jun 15, 2021 • edited Loading

alexec commented Jun 15, 2021

alexec commented Jun 15, 2021

zhongwenqin commented Jul 3, 2021

zhongwenqin commented Jul 3, 2021

zhongwenqin commented Jul 3, 2021

< hello world >

zhongwenqin commented Jul 3, 2021

avifried1 commented Jul 3, 2021

raviatmaryh commented Jul 6, 2021

zhongwenqin commented Jul 7, 2021

avifried1 commented Jul 17, 2021

alexec commented Dec 8, 2021

stale bot commented Feb 7, 2022

alexec commented Feb 21, 2022

avifried1 commented Jun 15, 2021 •

edited

Loading