-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS Agent was not able to stop the container #2808
Comments
@winningjai Can you please share the following details --
If this is reproducible, please enable debug level logs for Docker and Agent during the repro, and use the ECS logs collector. |
Hi @shubham2892, 1, This is the first time we are facing this error. Since we are not sure about the reason for this issue, we were not able to reproduce it. Also, this issue occurs to us only once till today. 2, The affected instance is currently serving our production traffic. Does there is any way to collect the logs without draining them? Since the ECS log is configured in debug mode already and the log I have attached above is in debug mode. Please take a look at the Instance CPU graph below It doesn't look overloaded. Also, we are running few other docker containers that are not managed by ECS inside the instance and they are working fine at the same time. Let me know your comment on this |
We have the same issue and we run some validation and apparently it is a docker issue. We could mitigate the issue applying this patch (https://github.com/moby/moby/pull/41293/files) into the source RPM from amazon linux, rebuild docker and install the package. Since is already fixed on the docker mainline, I suggest AWS take a further look since is affecting more customers. |
@barroca Thanks for pointing out the issue, I will take a further look into this. |
If you want to reproduce the issue, you will need a cluster with services running that have a desired count, on the instance where they are running run something that uses lots of I/O:
and on AWS console stop one task from the instance. You will see that the containerd-shim process stops, but on docker ps you can still see the container. ecs agent won't be able to continue and the cluster will have a pending task. |
Thanks for pointing out that bugfix @barroca, we have also tracked down this issue to a couple places where it was possible for ContainerStop and ContainerKill to hang forever in the docker engine: moby/moby#41586 and moby/moby#41588 It looks like that bugfix is in 19.03.14, we will look into updating our Amazon Linux docker version and once we have the two referenced PRs merged we will incorporate those changes as well. |
Thanks, @barroca Have you tried any workaround that can be applied to the ECS cluster instance to avoid this issue until we are updating the docker version? |
No workaround yet, I'm waiting for AWS to patch the docker version, but you can try to rebuild docker from the latest source code and use it, although I wouldn't recommend that cause it can cause some other problems and they won't provide support if you are using custom packages. |
Any update on this resolution? I had to roll back to ecs optimized image with v1.49 agent.
Is it possible ecs-agent is trying to unmount or has unmounted before the ecs task container has exited? I wasn't able to find a way to test this theory. |
Hi @sgtoj, Regarding efs volume, ecs-agent unmounts efs volumes only after the task is stopped. |
i think this is a different issue than the one in original post. A bug was introduced in ecs agent 1.50.3 such that the task network might be torn down before container stops, which can cause the EFS mount to stop functioning before container stops. This bug is fixed in ecs agent 1.51.0 (ref: #2838). @sgtoj could you try upgrading to agent version 1.51.0 and see if the issue persists? |
Closing this issue, please feel free to reopen if the issue persists after the Agent upgrade. |
ECS Agent version: 1.41.1
ecs-agent.log.2021-02-10-16.log
ecs-agent.log.2021-02-10-17.log
Time to be note:
ECS health check failure time: 10-feb-2021 16:26:35 PM UTC
New task sucessfully created time: 10-feb-2021 17:48:28 PM UTC
Initial state:
We have an ECS cluster that is attached to a Loadbalancer via the target group. So the ECS server will consider the target group health check as ECS health.
Today we encounter a health check failure and the ECS server initiates the deregister request to stop the running task in the unhealthy instance.
But the health check failure is due to the container was not able to respond back to ELB's health API.So, the container was healthy and serving the incoming requests.
Expected Behaviour:
Since the ECS server initiate a deregister request and even though the container is healthy, the ECS agent was expected to kill the container and create a new task in the ECS cluster.
Actual Behaviour:
In the ECS console immediately after a minute, it shows as target deregistered successfully and started a new task. But in the background inside the instance, the old container was not stopped and the ECS agent was trying to stop it for more than 90 minutes. Since the old container is not cleaned, it was not able to create a new task in the same instance(we don't maintain spares).
From the ECS log we are seeing below error message.
level=info time=2021-02-10T16:26:36Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: task moving to stopped, adding to stopgroup with sequence number: 3" module=task_manager.go level=debug time=2021-02-10T16:26:36Z msg="Updating task: [DES-E1-USEZ-td-desb:14 arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc, TaskStatus: (RUNNING->STOPPED) Containers: [DES-desb (RUNNING->RUNNING),]]" module=task.go
level=info time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: task not steady state or terminal; progressing it" module=task_manager.go level=debug time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: progressing containers and resources in task" module=task_manager.go level=debug time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: resource [cgroup] has already transitioned to or beyond the desired status REMOVED; current known is REMOVED" module=task_manager.go level=info time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: waiting for event for task" module=task_manager.go
level=error time=2021-02-10T17:47:36Z msg="DockerGoClient: error stopping container 7cbac80a481d1a9c61f97a12f6d3c830e9ca12f2927f82260a4da51c44e2a0ff: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" module=docker_client.go
level=info time=2021-02-10T17:00:36Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: 'DockerTimeoutError' error stopping container [DES-desb (Runtime ID: 7cbac80a481d1a9c61f97a12f6d3c830e9ca12f2927f82260a4da51c44e2a0ff)]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s" module=task_manager.go
Could someone help me to fix this?
The text was updated successfully, but these errors were encountered: