Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS Agent was not able to stop the container #2808

Closed
winningjai opened this issue Feb 12, 2021 · 12 comments
Closed

ECS Agent was not able to stop the container #2808

winningjai opened this issue Feb 12, 2021 · 12 comments

Comments

@winningjai
Copy link

winningjai commented Feb 12, 2021

ECS Agent version: 1.41.1
ecs-agent.log.2021-02-10-16.log
ecs-agent.log.2021-02-10-17.log

Time to be note:
ECS health check failure time: 10-feb-2021 16:26:35 PM UTC
New task sucessfully created time: 10-feb-2021 17:48:28 PM UTC

Initial state:
We have an ECS cluster that is attached to a Loadbalancer via the target group. So the ECS server will consider the target group health check as ECS health.
Today we encounter a health check failure and the ECS server initiates the deregister request to stop the running task in the unhealthy instance.
But the health check failure is due to the container was not able to respond back to ELB's health API.So, the container was healthy and serving the incoming requests.

Expected Behaviour:
Since the ECS server initiate a deregister request and even though the container is healthy, the ECS agent was expected to kill the container and create a new task in the ECS cluster.

Actual Behaviour:
In the ECS console immediately after a minute, it shows as target deregistered successfully and started a new task. But in the background inside the instance, the old container was not stopped and the ECS agent was trying to stop it for more than 90 minutes. Since the old container is not cleaned, it was not able to create a new task in the same instance(we don't maintain spares).

From the ECS log we are seeing below error message.
level=info time=2021-02-10T16:26:36Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: task moving to stopped, adding to stopgroup with sequence number: 3" module=task_manager.go level=debug time=2021-02-10T16:26:36Z msg="Updating task: [DES-E1-USEZ-td-desb:14 arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc, TaskStatus: (RUNNING->STOPPED) Containers: [DES-desb (RUNNING->RUNNING),]]" module=task.go

level=info time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: task not steady state or terminal; progressing it" module=task_manager.go level=debug time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: progressing containers and resources in task" module=task_manager.go level=debug time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: resource [cgroup] has already transitioned to or beyond the desired status REMOVED; current known is REMOVED" module=task_manager.go level=info time=2021-02-10T16:26:46Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: waiting for event for task" module=task_manager.go

level=error time=2021-02-10T17:47:36Z msg="DockerGoClient: error stopping container 7cbac80a481d1a9c61f97a12f6d3c830e9ca12f2927f82260a4da51c44e2a0ff: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" module=docker_client.go

level=info time=2021-02-10T17:00:36Z msg="Managed task [arn:aws:ecs:us-east-1:361190373704:task/DES-E1-USEZ/03ada87ee22e4f4996f1ad9224ace8cc]: 'DockerTimeoutError' error stopping container [DES-desb (Runtime ID: 7cbac80a481d1a9c61f97a12f6d3c830e9ca12f2927f82260a4da51c44e2a0ff)]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s" module=task_manager.go

Could someone help me to fix this?

@shubham2892
Copy link
Contributor

@winningjai Can you please share the following details --

  1. How often do you observe this error? Are you able to reproduce the error? If yes, can you outline the steps to reproduce the error?
  2. Please use ECS logs collector(https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-logs-collector.html) to collect instance level logs and send us at ecs-agent-external@ amazon.com. Logs need to include the time window of the event. From the logs you shared, it looks like docker daemon stopped running, could it be possible that Docker stopped because of CPU overload?

If this is reproducible, please enable debug level logs for Docker and Agent during the repro, and use the ECS logs collector.

@winningjai
Copy link
Author

Hi @shubham2892,
Thanks for the reply.
Please find answers to your questions below.

1, This is the first time we are facing this error. Since we are not sure about the reason for this issue, we were not able to reproduce it. Also, this issue occurs to us only once till today.

2, The affected instance is currently serving our production traffic. Does there is any way to collect the logs without draining them? Since the ECS log is configured in debug mode already and the log I have attached above is in debug mode.

Please take a look at the Instance CPU graph below
Screenshot 2021-02-18 at 9 45 46 AM

It doesn't look overloaded. Also, we are running few other docker containers that are not managed by ECS inside the instance and they are working fine at the same time.

Let me know your comment on this

@barroca
Copy link

barroca commented Feb 23, 2021

We have the same issue and we run some validation and apparently it is a docker issue. We could mitigate the issue applying this patch (https://github.com/moby/moby/pull/41293/files) into the source RPM from amazon linux, rebuild docker and install the package. Since is already fixed on the docker mainline, I suggest AWS take a further look since is affecting more customers.

@shubham2892
Copy link
Contributor

@barroca Thanks for pointing out the issue, I will take a further look into this.

@barroca
Copy link

barroca commented Feb 23, 2021

If you want to reproduce the issue, you will need a cluster with services running that have a desired count, on the instance where they are running run something that uses lots of I/O:

fio --rw=randrw --name=test --size=50M --direct=1 --bs=1024k --numjobs=20 --group_reporting --runtime=1200 --time_based

and on AWS console stop one task from the instance. You will see that the containerd-shim process stops, but on docker ps you can still see the container.

ecs agent won't be able to continue and the cluster will have a pending task.

@sparrc
Copy link
Contributor

sparrc commented Feb 23, 2021

Thanks for pointing out that bugfix @barroca, we have also tracked down this issue to a couple places where it was possible for ContainerStop and ContainerKill to hang forever in the docker engine: moby/moby#41586 and moby/moby#41588

It looks like that bugfix is in 19.03.14, we will look into updating our Amazon Linux docker version and once we have the two referenced PRs merged we will incorporate those changes as well.

@winningjai
Copy link
Author

Thanks, @barroca
We are facing this issue in production, it will take some time to apply these changes in production clusters. Meanwhile, it will be good if we can have any workaround to handle this issue.

Have you tried any workaround that can be applied to the ECS cluster instance to avoid this issue until we are updating the docker version?

@barroca
Copy link

barroca commented Feb 24, 2021

No workaround yet, I'm waiting for AWS to patch the docker version, but you can try to rebuild docker from the latest source code and use it, although I wouldn't recommend that cause it can cause some other problems and they won't provide support if you are using custom packages.

@sgtoj
Copy link

sgtoj commented Mar 30, 2021

Any update on this resolution? I had to roll back to ecs optimized image with v1.49 agent.

  • ecs agent wasn't able to stop, using ecs API, prometheus containers configured with efs as its storage
  • ecs agent logs had the same issue mentioned in ECS agent should handle EFS volume umount after a task container is killed #2810
  • container could be killed or stopped via docker stop or docker kill
  • ecs agent logs had those same log entries
  • configuring prometheus to use non-efs as its storage allowed for the container to be killed via ecs API
    • efs was still mounted to the container but not in use
  • issue affected 5 of 5 prometheus deployments in if 5 different accounts
  • tried various ways to workaround the issue including but not limited too...
    • handling SIGTERM/SIGINT manually
    • sending SIGKILL to promethues (non PID 1)
    • using docker's --init
    • trying different distros (busybox with prom/prometheus and debian with bitnami/prometheus)
  • v1.49 is unaffected by this issue

Is it possible ecs-agent is trying to unmount or has unmounted before the ecs task container has exited? I wasn't able to find a way to test this theory.

@mythri-garaga
Copy link
Contributor

Hi @sgtoj,
Thanks for reporting this issue. We are still waiting on moby/moby#41586 and moby/moby#41588 bug fixes to be merged.

Regarding efs volume, ecs-agent unmounts efs volumes only after the task is stopped.

@fenxiong
Copy link
Contributor

fenxiong commented Apr 7, 2021

Is it possible ecs-agent is trying to unmount or has unmounted before the ecs task container has exited?

i think this is a different issue than the one in original post. A bug was introduced in ecs agent 1.50.3 such that the task network might be torn down before container stops, which can cause the EFS mount to stop functioning before container stops. This bug is fixed in ecs agent 1.51.0 (ref: #2838). @sgtoj could you try upgrading to agent version 1.51.0 and see if the issue persists?

@shubham2892
Copy link
Contributor

Closing this issue, please feel free to reopen if the issue persists after the Agent upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants