ECS Agent was not able to stop the container #2808

winningjai · 2021-02-12T12:29:30Z

ECS Agent version: 1.41.1
ecs-agent.log.2021-02-10-16.log
ecs-agent.log.2021-02-10-17.log

Time to be note:
ECS health check failure time: 10-feb-2021 16:26:35 PM UTC
New task sucessfully created time: 10-feb-2021 17:48:28 PM UTC

Initial state:
We have an ECS cluster that is attached to a Loadbalancer via the target group. So the ECS server will consider the target group health check as ECS health.
Today we encounter a health check failure and the ECS server initiates the deregister request to stop the running task in the unhealthy instance.
But the health check failure is due to the container was not able to respond back to ELB's health API.So, the container was healthy and serving the incoming requests.

Expected Behaviour:
Since the ECS server initiate a deregister request and even though the container is healthy, the ECS agent was expected to kill the container and create a new task in the ECS cluster.

Actual Behaviour:
In the ECS console immediately after a minute, it shows as target deregistered successfully and started a new task. But in the background inside the instance, the old container was not stopped and the ECS agent was trying to stop it for more than 90 minutes. Since the old container is not cleaned, it was not able to create a new task in the same instance(we don't maintain spares).

From the ECS log we are seeing below error message.
level=info time=2021-02-10T16:26:36Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: task moving to stopped, adding to stopgroup with sequence number: 3" module=task_manager.go level=debug time=2021-02-10T16:26:36Z msg="Updating task: [DES-E1-USEZ-td-desb:14 arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc, TaskStatus: (RUNNING->STOPPED) Containers: [DES-desb (RUNNING->RUNNING),]]" module=task.go

level=info time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: task not steady state or terminal; progressing it" module=task_manager.go level=debug time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: progressing containers and resources in task" module=task_manager.go level=debug time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: resource [cgroup] has already transitioned to or beyond the desired status REMOVED; current known is REMOVED" module=task_manager.go level=info time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: waiting for event for task" module=task_manager.go

level=error time=2021-02-10T17:47:36Z msg="DockerGoClient: error stopping container 7cbac80a481d1a9c61f97a12f6d3c830e9ca12f2927f82260a4da51c44e2a0ff: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" module=docker_client.go

level=info time=2021-02-10T17:00:36Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: 'DockerTimeoutError' error stopping container [DES-desb (Runtime ID: 7cbac80a481d1a9c61f97a12f6d3c830e9ca12f2927f82260a4da51c44e2a0ff)]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s" module=task_manager.go

Could someone help me to fix this?

The text was updated successfully, but these errors were encountered:

shubham2892 · 2021-02-15T03:45:10Z

@winningjai Can you please share the following details --

How often do you observe this error? Are you able to reproduce the error? If yes, can you outline the steps to reproduce the error?
Please use ECS logs collector(https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-logs-collector.html) to collect instance level logs and send us at ecs-agent-external@ amazon.com. Logs need to include the time window of the event. From the logs you shared, it looks like docker daemon stopped running, could it be possible that Docker stopped because of CPU overload?

If this is reproducible, please enable debug level logs for Docker and Agent during the repro, and use the ECS logs collector.

winningjai · 2021-02-18T04:32:22Z

Hi @shubham2892,
Thanks for the reply.
Please find answers to your questions below.

1, This is the first time we are facing this error. Since we are not sure about the reason for this issue, we were not able to reproduce it. Also, this issue occurs to us only once till today.

2, The affected instance is currently serving our production traffic. Does there is any way to collect the logs without draining them? Since the ECS log is configured in debug mode already and the log I have attached above is in debug mode.

Please take a look at the Instance CPU graph below

It doesn't look overloaded. Also, we are running few other docker containers that are not managed by ECS inside the instance and they are working fine at the same time.

Let me know your comment on this

barroca · 2021-02-23T10:58:12Z

We have the same issue and we run some validation and apparently it is a docker issue. We could mitigate the issue applying this patch (https://github.com/moby/moby/pull/41293/files) into the source RPM from amazon linux, rebuild docker and install the package. Since is already fixed on the docker mainline, I suggest AWS take a further look since is affecting more customers.

shubham2892 · 2021-02-23T11:18:05Z

@barroca Thanks for pointing out the issue, I will take a further look into this.

barroca · 2021-02-23T11:28:12Z

If you want to reproduce the issue, you will need a cluster with services running that have a desired count, on the instance where they are running run something that uses lots of I/O:

fio --rw=randrw --name=test --size=50M --direct=1 --bs=1024k --numjobs=20 --group_reporting --runtime=1200 --time_based

and on AWS console stop one task from the instance. You will see that the containerd-shim process stops, but on docker ps you can still see the container.

ecs agent won't be able to continue and the cluster will have a pending task.

sparrc · 2021-02-23T19:04:27Z

Thanks for pointing out that bugfix @barroca, we have also tracked down this issue to a couple places where it was possible for ContainerStop and ContainerKill to hang forever in the docker engine: moby/moby#41586 and moby/moby#41588

It looks like that bugfix is in 19.03.14, we will look into updating our Amazon Linux docker version and once we have the two referenced PRs merged we will incorporate those changes as well.

winningjai · 2021-02-24T05:16:46Z

Thanks, @barroca
We are facing this issue in production, it will take some time to apply these changes in production clusters. Meanwhile, it will be good if we can have any workaround to handle this issue.

Have you tried any workaround that can be applied to the ECS cluster instance to avoid this issue until we are updating the docker version?

barroca · 2021-02-24T07:24:44Z

No workaround yet, I'm waiting for AWS to patch the docker version, but you can try to rebuild docker from the latest source code and use it, although I wouldn't recommend that cause it can cause some other problems and they won't provide support if you are using custom packages.

sgtoj · 2021-03-30T09:16:54Z

Any update on this resolution? I had to roll back to ecs optimized image with v1.49 agent.

ecs agent wasn't able to stop, using ecs API, prometheus containers configured with efs as its storage
ecs agent logs had the same issue mentioned in ECS agent should handle EFS volume umount after a task container is killed #2810
container could be killed or stopped via docker stop or docker kill
ecs agent logs had those same log entries
configuring prometheus to use non-efs as its storage allowed for the container to be killed via ecs API
- efs was still mounted to the container but not in use
issue affected 5 of 5 prometheus deployments in if 5 different accounts
tried various ways to workaround the issue including but not limited too...
- handling SIGTERM/SIGINT manually
- sending SIGKILL to promethues (non PID 1)
- using docker's --init
- trying different distros (busybox with prom/prometheus and debian with bitnami/prometheus)
v1.49 is unaffected by this issue

Is it possible ecs-agent is trying to unmount or has unmounted before the ecs task container has exited? I wasn't able to find a way to test this theory.

mythri-garaga · 2021-04-06T21:43:05Z

Hi @sgtoj,
Thanks for reporting this issue. We are still waiting on moby/moby#41586 and moby/moby#41588 bug fixes to be merged.

Regarding efs volume, ecs-agent unmounts efs volumes only after the task is stopped.

fenxiong · 2021-04-07T20:58:43Z

Is it possible ecs-agent is trying to unmount or has unmounted before the ecs task container has exited?

i think this is a different issue than the one in original post. A bug was introduced in ecs agent 1.50.3 such that the task network might be torn down before container stops, which can cause the EFS mount to stop functioning before container stops. This bug is fixed in ecs agent 1.51.0 (ref: #2838). @sgtoj could you try upgrading to agent version 1.51.0 and see if the issue persists?

shubham2892 · 2021-05-19T09:07:21Z

Closing this issue, please feel free to reopen if the issue persists after the Agent upgrade.

shubham2892 added the more info needed label Feb 15, 2021

sparrc removed the more info needed label Feb 23, 2021

sgtoj mentioned this issue Mar 30, 2021

ECS agent should handle EFS volume umount after a task container is killed #2810

Closed

shubham2892 added the more info needed label May 19, 2021

shubham2892 closed this as completed May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS Agent was not able to stop the container #2808

ECS Agent was not able to stop the container #2808

winningjai commented Feb 12, 2021 •

edited

Loading

shubham2892 commented Feb 15, 2021

winningjai commented Feb 18, 2021

barroca commented Feb 23, 2021

shubham2892 commented Feb 23, 2021

barroca commented Feb 23, 2021

sparrc commented Feb 23, 2021

winningjai commented Feb 24, 2021

barroca commented Feb 24, 2021 •

edited

Loading

sgtoj commented Mar 30, 2021 •

edited

Loading

mythri-garaga commented Apr 6, 2021

fenxiong commented Apr 7, 2021 •

edited

Loading

shubham2892 commented May 19, 2021

ECS Agent was not able to stop the container #2808

ECS Agent was not able to stop the container #2808

Comments

winningjai commented Feb 12, 2021 • edited Loading

shubham2892 commented Feb 15, 2021

winningjai commented Feb 18, 2021

barroca commented Feb 23, 2021

shubham2892 commented Feb 23, 2021

barroca commented Feb 23, 2021

sparrc commented Feb 23, 2021

winningjai commented Feb 24, 2021

barroca commented Feb 24, 2021 • edited Loading

sgtoj commented Mar 30, 2021 • edited Loading

mythri-garaga commented Apr 6, 2021

fenxiong commented Apr 7, 2021 • edited Loading

shubham2892 commented May 19, 2021

winningjai commented Feb 12, 2021 •

edited

Loading

barroca commented Feb 24, 2021 •

edited

Loading

sgtoj commented Mar 30, 2021 •

edited

Loading

fenxiong commented Apr 7, 2021 •

edited

Loading