ecs-agent container exiting or crashing or stopped #730

jamesongithub · 2017-03-13T18:24:34Z

just upgraded ecs-agent to 1.14 and docker to 1.13.1 and starting seeing random ecs-agent container disconnecting from ecs-cluster

ecs-agent logs

ecr.us-east-1.amazonaws.com/REDACTED:latest) (STOPPED->STOPPED) - Exit: 1"
2017-03-12T22:43:59Z [INFO] Removing Container Reference: REDACTED from Image State- sha256:7aea920f139a4e2337f05649ed4a7e092b671bf6533e21e19fa8d0669d0e75b3
2017-03-12T22:43:59Z [INFO] Saving state! module="statemanager"
2017-03-12T22:44:01Z [INFO] Saving state! module="statemanager"
2017-03-12T22:44:10Z [INFO] Cleaning up task's containers and data module="TaskEngine" task="REDACTED:1 arn:aws:ecs:us-east-1:REDACTED:task/f03394b5-fc1f-4d4b-9898-1289f06ef90a, Status: (STOPPED->STOPPED) Containers: REDACTED (STOPPED->STOPPED),]"
2017-03-12T22:44:10Z [INFO] Removing container module="TaskEngine" task="REDACTED:1 arn:aws:ecs:us-east-1:REDACTED:task/f03394b5-fc1f-4d4b-9898-1289f06ef90a, Status: (STOPPED->STOPPED) Containers: [REDACTED (STOPPED->STOPPED),]" container="REDACTED(REDACTED.dkr.ecr.us-east-1.amazonaws.com/REDACTED:latest) (STOPPED->STOPPED) - Exit: 1"
2017-03-12T22:44:10Z [INFO] Removing Container Reference: REDACTED from Image State- sha256:7aea920f139a4e2337f05649ed4a7e092b671bf6533e21e19fa8d0669d0e75b3
2017-03-12T22:44:10Z [INFO] Saving state! module="statemanager"
2017-03-12T22:44:11Z [INFO] Saving state! module="statemanager"
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

audit log

17-03-12T22:43:21Z 200 172.17.0.10:51988 "/v2/credentials" "aws-sdk-nodejs/2.6.2 linux/v6.9.2" arn:aws:ecs:us-east-1:REDACTED:task/29570878-b181-44b0-bf83-c0041f70308f GetCredentials 1 REDACTED arn:aws:ecs:us-east-1:REDACTED:container-instance/39b5555c-29f0-40e4-92c0-68d286ba0ade
2017-03-12T22:43:29Z 200 172.17.0.10:52015 "/v2/credentials" "aws-sdk-go/1.5.5 (go1.7.3; linux; amd64)" arn:aws:ecs:us-east-1:REDACTED:task/b7f8caa5-75c3-4e61-a9d0-04b6affb519d GetCredentials 1 REDACTED arn:aws:ecs:us-east-1:REDACTED:container-instance/39b5555c-29f0-40e4-92c0-68d286ba0ade
2017-03-12T22:43:33Z 200 172.17.0.10:52020 "/v2/credentials" "aws-sdk-go/1.5.5 (go1.7.3; linux; amd64)" arn:aws:ecs:us-east-1:REDACTED:task/c953201a-06ff-442d-924d-561a3e6f967d GetCredentials 1 REDACTED arn:aws:ecs:us-east-1:REDACTED:container-instance/39b5555c-29f0-40e4-92c0-68d286ba0ade
2017-03-12T22:43:56Z 200 172.17.0.10:52072 "/v2/credentials" "aws-sdk-go/1.5.5 (go1.7.3; linux; amd64)" arn:aws:ecs:us-east-1:REDACTED:task/ff5f25dc-0dcb-4049-b758-a2fc38ef7f7d GetCredentials 1 REDACTED:aws:ecs:us-east-1:REDACTED:container-instance/39b5555c-29f0-40e4-92c0-68d286ba0ade
2017-03-12T22:43:56Z 200 172.17.0.11:56262 "/v2/credentials" "aws-sdk-go/1.5.5 (go1.7.3; linux; amd64)" arn:aws:ecs:us-east-1:REDACTED:task/1b431364-1344-4f76-a685-19d5c57b0848 GetCredentials 1 REDACTED-ecs arn:aws:ecs:us-east-1:REDACTED:container-instance/39b5555c-29f0-40e4-92c0-68d286ba0ade

docker log

time="2017-03-12T22:44:30.345331816Z" level=debug msg="devmapper: DeleteDevice START(hash=a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5 syncDelete=false)"
time="2017-03-12T22:44:30.345356805Z" level=debug msg="devmapper: deactivateDevice START(a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5)" 
time="2017-03-12T22:44:30.345418346Z" level=debug msg="devmapper: deactivateDevice END(a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5)"
time="2017-03-12T22:44:30.360273479Z" level=debug msg="devmapper: unregisterDevice(27838, a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5)"
time="2017-03-12T22:44:30.360968795Z" level=debug msg="devmapper: DeleteDevice END(hash=a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5 syncDelete=false)"
time="2017-03-12T22:44:30.361018761Z" level=debug msg="devmapper: DeleteDevice START(hash=a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5-init syncDelete=false)"
time="2017-03-12T22:44:30.361028821Z" level=debug msg="devmapper: deactivateDevice START(a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5-init)"
time="2017-03-12T22:44:30.361057155Z" level=debug msg="devmapper: deactivateDevice END(a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5-init)"
time="2017-03-12T22:44:30.386539876Z" level=debug msg="devmapper: unregisterDevice(27837, a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5-init)"
time="2017-03-12T22:44:30.387158712Z" level=debug msg="devmapper: DeleteDevice END(hash=a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5-init syncDelete=false)"
time="2017-03-12T22:44:30.387892604Z" level=debug msg="Calling GET /v1.17/containers/34323187c6f0e2560bb1a3b70c9afe255a880ca36d51df6552e0822ff5da254f/json" 
time="2017-03-12T22:44:30.387955093Z" level=error msg="Handler for GET /v1.17/containers/34323187c6f0e2560bb1a3b70c9afe255a880ca36d51df6552e0822ff5da254f/json returned error: No such container: 34323187c6f0e2560bb1a3b70c9afe255a880ca36d51df6552e0822ff5da254f"
time="2017-03-12T22:44:33.470099444Z" level=debug msg="Calling DELETE /v1.17/containers/REDACTED-e0af9cd5acdcd7b30700?v=1"

The text was updated successfully, but these errors were encountered:

jamesongithub · 2017-03-13T18:28:49Z

I have a restart policy of always not sure why it didn't restart the agent. it's like docker wasn't even aware the container stopped

liwenwu-amazon · 2017-03-22T03:55:57Z

@jamesongithub
Thank you for reporting this. Are you able to reproduce this crash? If you do, can you provide the full set of stacks?

liwenwu-amazon · 2017-03-22T04:01:23Z

This crash seems have same root cause as #707

jamesongithub · 2017-03-22T14:19:19Z

@liwenwu-amazon i don't having traces, im unable to repo consistently. those logs are all i have. curious why you think #707 is the same issue.

richardpen · 2017-03-30T19:08:36Z

@jamesongithub From the logs you provided, I didn't see any abnormal message. The connection is false because the agent periodically disconnects and connects back if there is no activity for a long time. How long does the connection stay in false, the agent should connected back in less than 2 minutes, if the connection stay in false for a long time, then there may be some problem, if that's the case please send me the logs of ecs-agent. Also could you tell me how this behavior impacted your production?

Thanks,
Peng

jamesongithub · 2017-04-11T16:25:29Z

Hi @richardpen. After it becomes false, I have never seen it connect again, unless we manually go in an restart the agent. Why does the agent disconnect when there is no activity? What happens when a task needs to be spawned on that host during the 2 minutes?

All the logs I have are attached. After the agent disconnected, there were no more logs. This of course impacts us. The host basically doesn't accept new tasks.

cabbruzzese · 2017-04-21T22:53:45Z

@jamesongithub I'm sorry that you're still experiencing trouble. Do you have output from the ecs-init logs? Can you also send the output of docker info? If you do not feel comfortable sharing those logs on this github issue you can e-mail them to me at abbrcale at amazon.com

jamesongithub · 2017-04-22T02:59:16Z

@cabbruzzese

docker info

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 17.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-64-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 59.97 GiB
Name: REDACTED
ID: REDACTED
Docker Root Dir: /REDACTED
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 26
 Goroutines: 34
 System Time: 2017-04-22T02:57:03.238495078Z
 EventsListeners: 1
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

We don't have any ecs-init logs, since we're not using the AWS ecs-optimized ami's.

Any idea for my questions from the last post? Seems strange that agents disconnecting just because there is no activity is regular behavior.

aaithal · 2017-04-26T00:14:32Z

Hi @jamesongithub, to answer some of the questions you asked in this thread:

Why does the agent disconnect when there is no activity? What happens when a task needs to be spawned on that host during the 2 minutes?

Agent establishes a long running websocket connection with ECS Backend so that the backend communication service can send it state changes to apply. Since the connection authorization and authentication happens at connection establishment, we want to enforce the agent to re-authenticate and re-authorize itself periodically.

The 2 minute duration is the worst case duration in case of backoff/retry and it should be much shorter than that during the normal course of operation.

As for as determining why the Agent has been disconnected for an extended duration of time, I did take a look at the logs that you sent to @richardpen and couldn't find anything that was obviously wrong with the Agent. I'm sorry to ask this of you again, but when you see this issue resurface, could you please make the Agent emit a stack trace by running docker kill -s USR1 ecs-agent before restarting it and send us the log you get from that (you can send them to aithal at amazon dot com)?

Thanks,
Anirudh

samuelkarp · 2017-05-16T23:25:51Z

@jamesongithub Just checking in; are you still experiencing problems? If so, were you able to run docker kill -s USR1 ecs-agent to generate debug information?

jamesongithub · 2017-05-17T03:49:29Z

hi, i havent seen it recently. however we're running the ecs-agent container with a always restart policy now. the problem may still exist.

samuelkarp · 2017-05-17T17:59:17Z

hi, i havent seen it recently. however we're running the ecs-agent container with a always restart policy now. the problem may still exist.

Interesting, does that mean that the agent was crashing/exiting instead of just remaining disconnected?

jamesongithub · 2017-05-18T02:00:33Z

probably. ive seen the latter, but it was a few versions ago.

samuelkarp · 2017-05-18T21:12:46Z

@jamesongithub Thanks. I'm going to close this issue for now since you're no longer seeing problems, though if you do experience problems again please let us know. In order for us to root-cause and understand what's going on, we'll need the detailed logging/debugging information that @aaithal asked for.

jamesongithub · 2017-05-23T20:29:18Z

@samuelkarp @aaithal I just hit the again. Agents running, but disconnected. Can we reopen? I've sent logs to @aaithal and @cabbruzzese. If you give your email @samuelkarp I can send it to you as well.

jamesongithub · 2017-05-23T20:35:35Z

stack trace https://gist.github.com/jamesongithub/d71ffc74bfa72f03709fc44005a5ef6f

samuelkarp · 2017-05-23T21:02:44Z

Yep, I'll reopen. I can ask @aaithal and @cabbruzzese for the logs.

samuelkarp · 2017-05-23T21:41:33Z

@jamesongithub Can you send them directly to me at skarp at amazon.com?

jamesongithub · 2017-05-23T22:04:22Z

@samuelkarp done. thank you.

samuelkarp · 2017-05-23T22:15:47Z

@jamesongithub I haven't received the logs. Can you try maybe sending from a different email address?

samuelkarp · 2017-06-01T17:43:11Z

@jamesongithub Can you try sending the logs again?

jamesongithub · 2017-06-01T19:09:25Z

hi @samuelkarp, just sent over 2 emails. 1 attachment is ~24MB. the other ~ 10MB. haven't received any rejection emails. Is there a maximum size on your end?

samuelkarp · 2017-06-01T19:15:15Z

@jamesongithub Could be...I only received the email with the ~10 MB attachment.

jamesongithub · 2017-06-01T19:53:34Z

ok i broke them up and resent. thanks!

samuelkarp · 2017-06-06T07:41:02Z

Hi @jamesongithub,

Thank you for sending the logs. I've spent some time over the past few days looking at them and wanted to update you with what I found and what is still remaining to be determined.

From your logs, the following line (repeated many times) stood out to me:

2017-05-23T08:00:05Z [WARN] Blocking cleanup for task <REDACTED> arn:aws:ecs:<REDACTED>:task/6e4f55fa-b3e5-4200-b756-f38c84180397, Status: (STOPPED->STOPPED) Containers: [<REDACTED> (STOPPED->STOPPED),] until the task has been reported stopped. SentStatus: RUNNING (6690/8640)

From this, I've been able to determine that you're affected by one known bug, but I also believe that there are potentially two other bugs affecting you. The known bug involves extended retries for requests not being properly resigned; this is fixed by #786 and will be released with version 1.14.2 of the agent.

The first potential bug involves why some requests relating to the specific task above failed and got into extended retries. Along with the retries that I was seeing, I also saw some other network-related failures in the logs you sent. I've been looking for the requests I expect to see in the ECS service logs and haven't been able to find them so far (though I'm not done looking yet). Are you running the ECS agent behind an HTTP proxy? If so, do you know if the proxy experienced some sort of failure around 2017-05-21T00:15:05Z?

The second potential bug involves restoring after a restart. What I was able to see from the ECS service logs lead me to believe that the ECS agent may have restarted around 2017-05-24T02:03:53Z. Unfortunately, that time period is missing from the logs that you sent me. If the agent did restart at 2017-05-24T02:03:53Z and then fail to reconnect, that may indicate a different codepath is at fault. In order to get more information on this, I'd like to look at the ECS agent's state file, which is normally located at /var/lib/ecs/data/ecs_agent_data.json. If you still have that file, I'd appreciate it if you sent it to me (please feel free to remove anything that might be sensitive from that file before sending it; I'm primarily interested in the task ARNs and the various "state" fields that the agent records).

Sam

jamesongithub · 2017-06-06T21:10:12Z

Hi @samuelkarp,

Thanks for the analysis.

Yes, so we had a network issue around 2017-05-21T00:15:05Z. Basically our NAT failed, which prevented the agent from connected to ECS. During this time, it probably went into extended retries. After the network issue was fixed, the agent still wasn't able to join the cluster.

For the 2nd issue, I restarted the agent to see if it would help, but once it gets into that state, it seems to never recover. I'm not when the restart was. I think it was the 22nd or 23rd.

I'm sending you the agent state file by email.

Thanks
-james

richardpen · 2017-08-18T21:33:14Z

@jamesongithub We have released the 1.14.3 including the fix #786. Can you upgrade to the latest version and see if the problem still exists.

Thanks,
Peng

jamesongithub · 2017-08-19T02:36:40Z

yes, we've been running it for a weeks now. haven't seen it, although haven't had any network issues yet.

richardpen · 2017-08-24T18:39:20Z

@jamesongithub I'm closing this for now, feel free to reopen it if you run into this issue in the future.

liwenwu-amazon added the kind/bug label Mar 22, 2017

nathanielks mentioned this issue Apr 24, 2017

ecs-agent disconnected, ConnectionExpired: Reconnect to continue segmentio/stack#122

Open

aaithal added the more info needed label Apr 26, 2017

aaithal mentioned this issue Apr 27, 2017

Agent Running But Disconnected #781

Closed

aaithal added the scope/ECS Agent label Apr 28, 2017

samuelkarp removed the kind/bug label May 18, 2017

samuelkarp closed this as completed May 18, 2017

samuelkarp reopened this May 23, 2017

samuelkarp added kind/bug and removed more info needed labels Jun 1, 2017

richardpen closed this as completed Aug 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ecs-agent container exiting or crashing or stopped #730

ecs-agent container exiting or crashing or stopped #730

jamesongithub commented Mar 13, 2017

jamesongithub commented Mar 13, 2017 •

edited

Loading

liwenwu-amazon commented Mar 22, 2017

liwenwu-amazon commented Mar 22, 2017

jamesongithub commented Mar 22, 2017

richardpen commented Mar 30, 2017

jamesongithub commented Apr 11, 2017

cabbruzzese commented Apr 21, 2017

jamesongithub commented Apr 22, 2017

aaithal commented Apr 26, 2017

samuelkarp commented May 16, 2017

jamesongithub commented May 17, 2017

samuelkarp commented May 17, 2017

jamesongithub commented May 18, 2017

samuelkarp commented May 18, 2017

jamesongithub commented May 23, 2017

jamesongithub commented May 23, 2017

samuelkarp commented May 23, 2017

samuelkarp commented May 23, 2017

jamesongithub commented May 23, 2017 •

edited

Loading

samuelkarp commented May 23, 2017

samuelkarp commented Jun 1, 2017

jamesongithub commented Jun 1, 2017

samuelkarp commented Jun 1, 2017

jamesongithub commented Jun 1, 2017

samuelkarp commented Jun 6, 2017

jamesongithub commented Jun 6, 2017

richardpen commented Aug 18, 2017

jamesongithub commented Aug 19, 2017

richardpen commented Aug 24, 2017

ecs-agent container exiting or crashing or stopped #730

ecs-agent container exiting or crashing or stopped #730

Comments

jamesongithub commented Mar 13, 2017

jamesongithub commented Mar 13, 2017 • edited Loading

liwenwu-amazon commented Mar 22, 2017

liwenwu-amazon commented Mar 22, 2017

jamesongithub commented Mar 22, 2017

richardpen commented Mar 30, 2017

jamesongithub commented Apr 11, 2017

cabbruzzese commented Apr 21, 2017

jamesongithub commented Apr 22, 2017

aaithal commented Apr 26, 2017

samuelkarp commented May 16, 2017

jamesongithub commented May 17, 2017

samuelkarp commented May 17, 2017

jamesongithub commented May 18, 2017

samuelkarp commented May 18, 2017

jamesongithub commented May 23, 2017

jamesongithub commented May 23, 2017

samuelkarp commented May 23, 2017

samuelkarp commented May 23, 2017

jamesongithub commented May 23, 2017 • edited Loading

samuelkarp commented May 23, 2017

samuelkarp commented Jun 1, 2017

jamesongithub commented Jun 1, 2017

samuelkarp commented Jun 1, 2017

jamesongithub commented Jun 1, 2017

samuelkarp commented Jun 6, 2017

jamesongithub commented Jun 6, 2017

richardpen commented Aug 18, 2017

jamesongithub commented Aug 19, 2017

richardpen commented Aug 24, 2017

jamesongithub commented Mar 13, 2017 •

edited

Loading

jamesongithub commented May 23, 2017 •

edited

Loading