Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait container stuck in a loop while main container finished successfully #6115

Closed
avifried1 opened this issue Jun 9, 2021 · 17 comments · Fixed by #6083
Closed

Wait container stuck in a loop while main container finished successfully #6115

avifried1 opened this issue Jun 9, 2021 · 17 comments · Fixed by #6083

Comments

@avifried1
Copy link

Summary

Wait container is occasionally stuck, workflow keeps running until it times out, while the main pod has finished successfully.

wait container log:

time="2021-06-09T18:01:47.217Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1623258000-2189159064"
time="2021-06-09T18:01:47.256Z" level=info msg="ignoring container \"main\" created at 2021-06-09 17:00:04 +0000 UTC, too long before process started"
time="2021-06-09T18:01:47.257Z" level=info msg="ignoring container \"wait\" created at 2021-06-09 17:00:03 +0000 UTC, too long before process started"
time="2021-06-09T18:01:48.257Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1623258000-2189159064"
time="2021-06-09T18:01:48.307Z" level=info msg="ignoring container \"main\" created at 2021-06-09 17:00:04 +0000 UTC, too long before process started"
time="2021-06-09T18:01:48.307Z" level=info msg="ignoring container \"wait\" created at 2021-06-09 17:00:03 +0000 UTC, too long before process started"

pod is in Completed status, main container is in terminated - Completed (exit code: 0) status, wait container is in running, ready status. And the workflow (it's a multi-step DAG workflow) hangs on running (eventually terminating after exceeding active deadline).
The main container is a very simple bash step of recording timestamp (sh -c 'date +%s > /tmp/timestamp.txt) that takes perhaps seconds to complete.

Diagnostics

Running self-managed Kubernetes on AWS.
Argo version 3.1.0-rc12
Docker executer

We've had this issue from time to time, we can't pinpoint the version in which we started seeing it.
Examining controller logs with the wf name in them:

time="2021-06-09T17:00:00.066Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.105Z" level=info msg="Updated phase -> Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.108Z" level=info msg="Retry nodeworkflow-analytics-pipeline-cron-1623258000 initialized Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.109Z" level=info msg="All of nodeworkflow-analytics-pipeline-cron-1623258000(0).generate-timestamp dependencies [] completed" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.109Z" level=info msg="DAG node workflow-analytics-pipeline-cron-1623258000-1105574991 initialized Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.110Z" level=info msg="Pod node workflow-analytics-pipeline-cron-1623258000-2189159064 initialized Pending" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.110Z" level=info msg="Retry node workflow-analytics-pipeline-cron-1623258000-2617208257 initialized Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.134Z" level=info msg="Created pod: workflow-analytics-pipeline-cron-1623258000(0).generate-timestamp(0) (workflow-analytics-pipeline-cron-1623258000-2189159064)" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:00.157Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=165964949 workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:10.068Z" level=info msg="Updating node workflow-analytics-pipeline-cron-1623258000-2189159064 message: ContainerCreating" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:10.084Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=165965126 workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:20.086Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:32.164Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:32.169Z" level=info msg="Updating node workflow-analytics-pipeline-cron-1623258000-2189159064 status Pending -> Running" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:32.416Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=165965421 workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:00:42.418Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:20:42.420Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T17:40:42.421Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T18:00:00.027Z" level=info msg="stopping 'workflow-analytics-pipeline-cron-1623258000'" namespace=workflows workflow=workflow-analytics-pipeline-cron
time="2021-06-09T18:00:10.049Z" level=info msg="Processing workflow" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000
time="2021-06-09T18:00:10.055Z" level=info msg="Applying shutdown deadline for pod workflow-analytics-pipeline-cron-1623258000-2189159064" namespace=workflows workflow=workflow-analytics-pipeline-cron-1623258000

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor

alexec commented Jun 9, 2021

Excellent. Thank you. Can I please ask you to test the argoexec:dev-docker image? I think this should fix your issue.

@avifried1
Copy link
Author

what do you suggest I test for? We weren't able to reproduce this every time, it just happens sporadically.

@alexec
Copy link
Contributor

alexec commented Jun 9, 2021

can you soak it and see?

@avifried1
Copy link
Author

avifried1 commented Jun 15, 2021

This issue is still happening in v3.1.0-rc14
Main container status is terminated - Completed (exit code: 0), Wait container is still in running, ready Pod status is Completed, but workflow still running.

wait pod stuck in loop:

time="2021-06-15T08:08:21.528Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1623740400-3832141026"
time="2021-06-15T08:08:21.570Z" level=info msg="ignoring container \"main\" created at 2021-06-15 07:00:08 +0000 UTC, too long before process started"
time="2021-06-15T08:08:21.570Z" level=info msg="ignoring container \"wait\" created at 2021-06-15 07:00:04 +0000 UTC, too long before process started"

Running the following command manually inside the pod:

docker ps --all --no-trunc --format='{{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}}' --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1623740400-3832141026

Exited (0) About an hour ago|main|fd3f4355ca5538e0dbfa449c5f1c6a2d4b9dc913eb76527bdefc25cf7e0cafc2|2021-06-15 07:00:08 +0000 UTC
Up About an hour|wait|e7adf5bf1b185788a8d1418d76a05f841bc720a5ba0ffdd2a4001c743941e15a|2021-06-15 07:00:04 +0000 UTC
Up About an hour|POD|d64fc515554fa0cbbc10e31a015f18c6d5983506052366c930f56783cae300c7|2021-06-15 07:00:00 +0000 UTC

Would note that this workflow uses the retry strategy:

retryStrategy:
      limit: 1
      retryPolicy: OnError

@alexec alexec reopened this Jun 15, 2021
@alexec
Copy link
Contributor

alexec commented Jun 15, 2021

This was not fixed.

@alexec alexec added this to the v3.0 milestone Jun 15, 2021
@alexec alexec self-assigned this Jun 15, 2021
@alexec
Copy link
Contributor

alexec commented Jun 15, 2021

@avifried1 can you please attach the full logs? I don't have quite enough logs to diagnose.

@zhongwenqin
Copy link

I have the same problem

@zhongwenqin
Copy link

[root@host1 ~]# kubectl -n argo logs hello-world-xr5lw wait
time="2021-07-03T01:14:10.519Z" level=info msg="Starting Workflow Executor" version="{v3.0.3 2021-05-11T21:14:20Z 0207105 v3.0.3 clean go1.15.7 gc linux/amd64}"
time="2021-07-03T01:14:10.528Z" level=info msg="Creating a docker executor"
time="2021-07-03T01:14:10.528Z" level=info msg="Executor (version: v3.0.3, build_date: 2021-05-11T21:14:20Z) initialized (pod: argo/hello-world-xr5lw) with template:\n{"name":"whalesay","inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"docker/whalesay:latest","command":["cowsay"],"args":["hello world"],"resources":{}},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"hello-world-xr5lw/hello-world-xr5lw"}}}"
time="2021-07-03T01:14:10.528Z" level=info msg="Starting annotations monitor"
time="2021-07-03T01:14:10.529Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"
time="2021-07-03T01:14:10.529Z" level=info msg="Starting deadline monitor"
time="2021-07-03T01:14:11.594Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"
time="2021-07-03T01:14:12.661Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"
time="2021-07-03T01:14:13.724Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"
time="2021-07-03T01:14:14.817Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"
time="2021-07-03T01:14:15.877Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=hello-world-xr5lw"

@zhongwenqin
Copy link

[root@host1 ~]# kubectl -n argo logs hello-world-xr5lw main


< hello world >

\
 \
  \     
                ##        .            
          ## ## ##       ==            
       ## ## ## ##      ===            
   /""""""""""""""""___/ ===        
     \______ o          __/            
      \    \        __/             
        \____\______/   

@zhongwenqin
Copy link

[root@host1 ~]# kubectl -n argo get pod
NAME READY STATUS RESTARTS AGE
argo-server-594b5c8cb6-bjtbl 1/1 Running 0 19h
hello-world-xr5lw 1/2 NotReady 0 13m
minio-77d6796f8d-p9bjj 1/1 Running 0 9d
postgres-546d9d68b-n7ptb 1/1 Running 0 9d
workflow-controller-df6fb694c-p5fvc 1/1 Running 2 9d
[root@host1 ~]# kubectl -n argo describe pod hello-world-xr5lw
Name: hello-world-xr5lw
Namespace: argo
Priority: 0
Node: 11.0.2.157/11.0.2.157
Start Time: Sat, 03 Jul 2021 09:14:09 +0800
Labels: workflows.argoproj.io/completed=false
workflows.argoproj.io/workflow=hello-world-xr5lw
Annotations: workflows.argoproj.io/node-name: hello-world-xr5lw
workflows.argoproj.io/template:
{"name":"whalesay","inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"docker/whalesay:latest","command":["cowsay"],"ar...
Status: Running
IP: 10.244.1.207
IPs:
IP: 10.244.1.207
Controlled By: Workflow/hello-world-xr5lw
Containers:
wait:
Container ID: containerd://707783fb47ba124b65e408a9e263c727f9518ed6d90cf9863ad31ae943d166c8
Image: argoproj/argoexec:v3.0.3
Image ID: docker.io/argoproj/argoexec@sha256:cf79ea62b8c9bf23bc5f5fd4dd1027d89c69353a83f62dd9d72d3e06f2e4c7a8
Port:
Host Port:
Command:
argoexec
wait
--loglevel
info
State: Running
Started: Sat, 03 Jul 2021 09:14:10 +0800
Ready: True
Restart Count: 0
Environment:
ARGO_POD_NAME: hello-world-xr5lw (v1:metadata.name)
GODEBUG: x509ignoreCN=0
ARGO_CONTAINER_NAME: wait
Mounts:
/argo/podmetadata from podmetadata (rw)
/argo/secret/my-minio-cred from my-minio-cred (ro)
/var/run/docker.sock from docker-sock (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jfqfk (ro)
main:
Container ID: containerd://73d5862ea4fb2e605798c2d1cde2ec4af74876b720bfd09cc16744bfafc8ac99
Image: docker/whalesay:latest
Image ID: sha256:c717279bbba020bf95ac72cf47b2c8abb3a383ad4b6996c1a7a9f2a7aaa480ad
Port:
Host Port:
Command:
cowsay
Args:
hello world
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 03 Jul 2021 09:14:15 +0800
Finished: Sat, 03 Jul 2021 09:14:15 +0800
Ready: False
Restart Count: 0
Environment:
ARGO_CONTAINER_NAME: main
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jfqfk (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
podmetadata:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
docker-sock:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType: Socket
my-minio-cred:
Type: Secret (a volume populated by a Secret)
SecretName: my-minio-cred
Optional: false
kube-api-access-jfqfk:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Normal Scheduled 14m default-scheduler Successfully assigned argo/hello-world-xr5lw to 11.0.2.157
Normal Pulled 14m kubelet Container image "argoproj/argoexec:v3.0.3" already present on machine
Normal Created 14m kubelet Created container wait
Normal Started 14m kubelet Started container wait
Normal Pulling 14m kubelet Pulling image "docker/whalesay:latest"
Normal Pulled 14m kubelet Successfully pulled image "docker/whalesay:latest" in 5.1757762s
Normal Created 14m kubelet Created container main
Normal Started 14m kubelet Started container main

@avifried1
Copy link
Author

wait container logs are basically an endless loop of:

time="2021-07-01T23:00:04.335Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=workflows --filter=label=io.kubernetes.pod.name=workflow-analytics-pipeline-cron-1625173200-1458802104"
time="2021-07-01T23:00:04.379Z" level=info msg="ignoring container \"main\" created at 2021-07-01 21:01:02 +0000 UTC, too long before process started"
time="2021-07-01T23:00:04.379Z" level=info msg="ignoring container \"wait\" created at 2021-07-01 21:00:56 +0000 UTC, too long before process started"
time="2021-07-01T23:00:04.379Z" level=info msg="ignoring container \"init\" created at 2021-07-01 21:00:37 +0000 UTC, too long before process started"

@raviatmaryh
Copy link

Had the same issue as others. main container finishes successfully. wait container keeps looping with docker ps...

Did not see this issue on local minikube cluster, but ran into it on digitalocean cluster.

For now, I've switched the executor to k8sapi and I'm not seeing any issues.

@zhongwenqin
Copy link

@raviatmaryh thank you ,the issue has been solve.

@alexec alexec closed this as completed Jul 14, 2021
@avifried1
Copy link
Author

not sure why this was closed, the issue hasn't been solved - it has been circumvented by using k8sapin executor. Using the default Docker executor still has this issue.

@alexec
Copy link
Contributor

alexec commented Dec 8, 2021

The Docker executor is being sunset. You should switch to using either PNS or Emissary:

https://blog.argoproj.io/sunsetting-docker-executor-support-in-argo-workflows-77d352d601c?source=collection_home---------3-------------------------------

@alexec alexec removed their assignment Jan 24, 2022
@alexec alexec removed this from the 2021 Q3 milestone Jan 24, 2022
@stale
Copy link

stale bot commented Feb 7, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the problem/stale This has not had a response in some time label Feb 7, 2022
@stale stale bot removed the problem/stale This has not had a response in some time label Feb 7, 2022
@alexec
Copy link
Contributor

alexec commented Feb 21, 2022

I think this has been fixed.

@alexec alexec closed this as completed Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants