Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ecs-agent container exiting or crashing or stopped #730

Closed
jamesongithub opened this issue Mar 13, 2017 · 29 comments
Closed

ecs-agent container exiting or crashing or stopped #730

jamesongithub opened this issue Mar 13, 2017 · 29 comments

Comments

@jamesongithub
Copy link

just upgraded ecs-agent to 1.14 and docker to 1.13.1 and starting seeing random ecs-agent container disconnecting from ecs-cluster

ecs-agent logs

ecr.us-east-1.amazonaws.com/REDACTED:latest) (STOPPED->STOPPED) - Exit: 1"
2017-03-12T22:43:59Z [INFO] Removing Container Reference: REDACTED from Image State- sha256:7aea920f139a4e2337f05649ed4a7e092b671bf6533e21e19fa8d0669d0e75b3
2017-03-12T22:43:59Z [INFO] Saving state! module="statemanager"
2017-03-12T22:44:01Z [INFO] Saving state! module="statemanager"
2017-03-12T22:44:10Z [INFO] Cleaning up task's containers and data module="TaskEngine" task="REDACTED:1 arn:aws:ecs:us-east-1:REDACTED:task/f03394b5-fc1f-4d4b-9898-1289f06ef90a, Status: (STOPPED->STOPPED) Containers: REDACTED (STOPPED->STOPPED),]"
2017-03-12T22:44:10Z [INFO] Removing container module="TaskEngine" task="REDACTED:1 arn:aws:ecs:us-east-1:REDACTED:task/f03394b5-fc1f-4d4b-9898-1289f06ef90a, Status: (STOPPED->STOPPED) Containers: [REDACTED (STOPPED->STOPPED),]" container="REDACTED(REDACTED.dkr.ecr.us-east-1.amazonaws.com/REDACTED:latest) (STOPPED->STOPPED) - Exit: 1"
2017-03-12T22:44:10Z [INFO] Removing Container Reference: REDACTED from Image State- sha256:7aea920f139a4e2337f05649ed4a7e092b671bf6533e21e19fa8d0669d0e75b3
2017-03-12T22:44:10Z [INFO] Saving state! module="statemanager"
2017-03-12T22:44:11Z [INFO] Saving state! module="statemanager"
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

audit log

17-03-12T22:43:21Z 200 172.17.0.10:51988 "/v2/credentials" "aws-sdk-nodejs/2.6.2 linux/v6.9.2" arn:aws:ecs:us-east-1:REDACTED:task/29570878-b181-44b0-bf83-c0041f70308f GetCredentials 1 REDACTED arn:aws:ecs:us-east-1:REDACTED:container-instance/39b5555c-29f0-40e4-92c0-68d286ba0ade
2017-03-12T22:43:29Z 200 172.17.0.10:52015 "/v2/credentials" "aws-sdk-go/1.5.5 (go1.7.3; linux; amd64)" arn:aws:ecs:us-east-1:REDACTED:task/b7f8caa5-75c3-4e61-a9d0-04b6affb519d GetCredentials 1 REDACTED arn:aws:ecs:us-east-1:REDACTED:container-instance/39b5555c-29f0-40e4-92c0-68d286ba0ade
2017-03-12T22:43:33Z 200 172.17.0.10:52020 "/v2/credentials" "aws-sdk-go/1.5.5 (go1.7.3; linux; amd64)" arn:aws:ecs:us-east-1:REDACTED:task/c953201a-06ff-442d-924d-561a3e6f967d GetCredentials 1 REDACTED arn:aws:ecs:us-east-1:REDACTED:container-instance/39b5555c-29f0-40e4-92c0-68d286ba0ade
2017-03-12T22:43:56Z 200 172.17.0.10:52072 "/v2/credentials" "aws-sdk-go/1.5.5 (go1.7.3; linux; amd64)" arn:aws:ecs:us-east-1:REDACTED:task/ff5f25dc-0dcb-4049-b758-a2fc38ef7f7d GetCredentials 1 REDACTED:aws:ecs:us-east-1:REDACTED:container-instance/39b5555c-29f0-40e4-92c0-68d286ba0ade
2017-03-12T22:43:56Z 200 172.17.0.11:56262 "/v2/credentials" "aws-sdk-go/1.5.5 (go1.7.3; linux; amd64)" arn:aws:ecs:us-east-1:REDACTED:task/1b431364-1344-4f76-a685-19d5c57b0848 GetCredentials 1 REDACTED-ecs arn:aws:ecs:us-east-1:REDACTED:container-instance/39b5555c-29f0-40e4-92c0-68d286ba0ade

docker log

time="2017-03-12T22:44:30.345331816Z" level=debug msg="devmapper: DeleteDevice START(hash=a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5 syncDelete=false)"
time="2017-03-12T22:44:30.345356805Z" level=debug msg="devmapper: deactivateDevice START(a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5)" 
time="2017-03-12T22:44:30.345418346Z" level=debug msg="devmapper: deactivateDevice END(a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5)"
time="2017-03-12T22:44:30.360273479Z" level=debug msg="devmapper: unregisterDevice(27838, a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5)"
time="2017-03-12T22:44:30.360968795Z" level=debug msg="devmapper: DeleteDevice END(hash=a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5 syncDelete=false)"
time="2017-03-12T22:44:30.361018761Z" level=debug msg="devmapper: DeleteDevice START(hash=a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5-init syncDelete=false)"
time="2017-03-12T22:44:30.361028821Z" level=debug msg="devmapper: deactivateDevice START(a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5-init)"
time="2017-03-12T22:44:30.361057155Z" level=debug msg="devmapper: deactivateDevice END(a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5-init)"
time="2017-03-12T22:44:30.386539876Z" level=debug msg="devmapper: unregisterDevice(27837, a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5-init)"
time="2017-03-12T22:44:30.387158712Z" level=debug msg="devmapper: DeleteDevice END(hash=a7c00fc2e139eefeef4209b547a92e2121ecf106bee3c0d2acf0b6aa4fddcca5-init syncDelete=false)"
time="2017-03-12T22:44:30.387892604Z" level=debug msg="Calling GET /v1.17/containers/34323187c6f0e2560bb1a3b70c9afe255a880ca36d51df6552e0822ff5da254f/json" 
time="2017-03-12T22:44:30.387955093Z" level=error msg="Handler for GET /v1.17/containers/34323187c6f0e2560bb1a3b70c9afe255a880ca36d51df6552e0822ff5da254f/json returned error: No such container: 34323187c6f0e2560bb1a3b70c9afe255a880ca36d51df6552e0822ff5da254f"
time="2017-03-12T22:44:33.470099444Z" level=debug msg="Calling DELETE /v1.17/containers/REDACTED-e0af9cd5acdcd7b30700?v=1"

image

@jamesongithub
Copy link
Author

jamesongithub commented Mar 13, 2017

I have a restart policy of always not sure why it didn't restart the agent. it's like docker wasn't even aware the container stopped

@liwenwu-amazon
Copy link
Contributor

@jamesongithub
Thank you for reporting this. Are you able to reproduce this crash? If you do, can you provide the full set of stacks?

@liwenwu-amazon
Copy link
Contributor

This crash seems have same root cause as #707

@jamesongithub
Copy link
Author

@liwenwu-amazon i don't having traces, im unable to repo consistently. those logs are all i have. curious why you think #707 is the same issue.

@richardpen
Copy link

@jamesongithub From the logs you provided, I didn't see any abnormal message. The connection is false because the agent periodically disconnects and connects back if there is no activity for a long time. How long does the connection stay in false, the agent should connected back in less than 2 minutes, if the connection stay in false for a long time, then there may be some problem, if that's the case please send me the logs of ecs-agent. Also could you tell me how this behavior impacted your production?

Thanks,
Peng

@jamesongithub
Copy link
Author

Hi @richardpen. After it becomes false, I have never seen it connect again, unless we manually go in an restart the agent. Why does the agent disconnect when there is no activity? What happens when a task needs to be spawned on that host during the 2 minutes?

All the logs I have are attached. After the agent disconnected, there were no more logs. This of course impacts us. The host basically doesn't accept new tasks.

@cabbruzzese
Copy link
Contributor

@jamesongithub I'm sorry that you're still experiencing trouble. Do you have output from the ecs-init logs? Can you also send the output of docker info? If you do not feel comfortable sharing those logs on this github issue you can e-mail them to me at abbrcale at amazon.com

@jamesongithub
Copy link
Author

@cabbruzzese

docker info

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 17.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-64-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 59.97 GiB
Name: REDACTED
ID: REDACTED
Docker Root Dir: /REDACTED
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 26
 Goroutines: 34
 System Time: 2017-04-22T02:57:03.238495078Z
 EventsListeners: 1
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

We don't have any ecs-init logs, since we're not using the AWS ecs-optimized ami's.

Any idea for my questions from the last post? Seems strange that agents disconnecting just because there is no activity is regular behavior.

@aaithal
Copy link
Contributor

aaithal commented Apr 26, 2017

Hi @jamesongithub, to answer some of the questions you asked in this thread:

Why does the agent disconnect when there is no activity? What happens when a task needs to be spawned on that host during the 2 minutes?

Agent establishes a long running websocket connection with ECS Backend so that the backend communication service can send it state changes to apply. Since the connection authorization and authentication happens at connection establishment, we want to enforce the agent to re-authenticate and re-authorize itself periodically.

The 2 minute duration is the worst case duration in case of backoff/retry and it should be much shorter than that during the normal course of operation.

As for as determining why the Agent has been disconnected for an extended duration of time, I did take a look at the logs that you sent to @richardpen and couldn't find anything that was obviously wrong with the Agent. I'm sorry to ask this of you again, but when you see this issue resurface, could you please make the Agent emit a stack trace by running docker kill -s USR1 ecs-agent before restarting it and send us the log you get from that (you can send them to aithal at amazon dot com)?

Thanks,
Anirudh

@samuelkarp
Copy link
Contributor

@jamesongithub Just checking in; are you still experiencing problems? If so, were you able to run docker kill -s USR1 ecs-agent to generate debug information?

@jamesongithub
Copy link
Author

hi, i havent seen it recently. however we're running the ecs-agent container with a always restart policy now. the problem may still exist.

@samuelkarp
Copy link
Contributor

hi, i havent seen it recently. however we're running the ecs-agent container with a always restart policy now. the problem may still exist.

Interesting, does that mean that the agent was crashing/exiting instead of just remaining disconnected?

@jamesongithub
Copy link
Author

probably. ive seen the latter, but it was a few versions ago.

@samuelkarp
Copy link
Contributor

@jamesongithub Thanks. I'm going to close this issue for now since you're no longer seeing problems, though if you do experience problems again please let us know. In order for us to root-cause and understand what's going on, we'll need the detailed logging/debugging information that @aaithal asked for.

@jamesongithub
Copy link
Author

@samuelkarp @aaithal I just hit the again. Agents running, but disconnected. Can we reopen? I've sent logs to @aaithal and @cabbruzzese. If you give your email @samuelkarp I can send it to you as well.

@jamesongithub
Copy link
Author

@samuelkarp
Copy link
Contributor

Yep, I'll reopen. I can ask @aaithal and @cabbruzzese for the logs.

@samuelkarp samuelkarp reopened this May 23, 2017
@samuelkarp
Copy link
Contributor

@jamesongithub Can you send them directly to me at skarp at amazon.com?

@jamesongithub
Copy link
Author

jamesongithub commented May 23, 2017

@samuelkarp done. thank you.

@samuelkarp
Copy link
Contributor

@jamesongithub I haven't received the logs. Can you try maybe sending from a different email address?

@samuelkarp
Copy link
Contributor

@jamesongithub Can you try sending the logs again?

@jamesongithub
Copy link
Author

hi @samuelkarp, just sent over 2 emails. 1 attachment is ~24MB. the other ~ 10MB. haven't received any rejection emails. Is there a maximum size on your end?

@samuelkarp
Copy link
Contributor

@jamesongithub Could be...I only received the email with the ~10 MB attachment.

@jamesongithub
Copy link
Author

ok i broke them up and resent. thanks!

@samuelkarp
Copy link
Contributor

Hi @jamesongithub,

Thank you for sending the logs. I've spent some time over the past few days looking at them and wanted to update you with what I found and what is still remaining to be determined.

From your logs, the following line (repeated many times) stood out to me:

2017-05-23T08:00:05Z [WARN] Blocking cleanup for task <REDACTED> arn:aws:ecs:<REDACTED>:task/6e4f55fa-b3e5-4200-b756-f38c84180397, Status: (STOPPED->STOPPED) Containers: [<REDACTED> (STOPPED->STOPPED),] until the task has been reported stopped. SentStatus: RUNNING (6690/8640) 

From this, I've been able to determine that you're affected by one known bug, but I also believe that there are potentially two other bugs affecting you. The known bug involves extended retries for requests not being properly resigned; this is fixed by #786 and will be released with version 1.14.2 of the agent.

The first potential bug involves why some requests relating to the specific task above failed and got into extended retries. Along with the retries that I was seeing, I also saw some other network-related failures in the logs you sent. I've been looking for the requests I expect to see in the ECS service logs and haven't been able to find them so far (though I'm not done looking yet). Are you running the ECS agent behind an HTTP proxy? If so, do you know if the proxy experienced some sort of failure around 2017-05-21T00:15:05Z?

The second potential bug involves restoring after a restart. What I was able to see from the ECS service logs lead me to believe that the ECS agent may have restarted around 2017-05-24T02:03:53Z. Unfortunately, that time period is missing from the logs that you sent me. If the agent did restart at 2017-05-24T02:03:53Z and then fail to reconnect, that may indicate a different codepath is at fault. In order to get more information on this, I'd like to look at the ECS agent's state file, which is normally located at /var/lib/ecs/data/ecs_agent_data.json. If you still have that file, I'd appreciate it if you sent it to me (please feel free to remove anything that might be sensitive from that file before sending it; I'm primarily interested in the task ARNs and the various "state" fields that the agent records).

Sam

@jamesongithub
Copy link
Author

Hi @samuelkarp,

Thanks for the analysis.

Yes, so we had a network issue around 2017-05-21T00:15:05Z. Basically our NAT failed, which prevented the agent from connected to ECS. During this time, it probably went into extended retries. After the network issue was fixed, the agent still wasn't able to join the cluster.

For the 2nd issue, I restarted the agent to see if it would help, but once it gets into that state, it seems to never recover. I'm not when the restart was. I think it was the 22nd or 23rd.

I'm sending you the agent state file by email.

Thanks
-james

@richardpen
Copy link

@jamesongithub We have released the 1.14.3 including the fix #786. Can you upgrade to the latest version and see if the problem still exists.

Thanks,
Peng

@jamesongithub
Copy link
Author

yes, we've been running it for a weeks now. haven't seen it, although haven't had any network issues yet.

@richardpen
Copy link

@jamesongithub I'm closing this for now, feel free to reopen it if you run into this issue in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants