Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent Running But Disconnected #781

Closed
tweibley opened this issue Apr 27, 2017 · 11 comments
Closed

Agent Running But Disconnected #781

tweibley opened this issue Apr 27, 2017 · 11 comments

Comments

@tweibley
Copy link

We keep seeing various agents go awol (agent is active but it's not connected anymore). I caught one in that state tonight and got a dump from it, with a few relevant log entries too: https://gist.github.com/tweibley/1e7c8ecacd15f813f276c62c08472375.

The only other thing I noticed is that we have a bunch of log entries like:

2017-04-26T23:30:26Z [WARN] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-04-26T23:40:19Z [WARN] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-04-26T23:53:46Z [WARN] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-04-26T23:54:19Z [WARN] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue

So far I haven't been able to figure out exactly why this is happening... Doesn't look like we are anywhere near max open files soft/hard limit, etc.

@tweibley
Copy link
Author

@tweibley
Copy link
Author

Continuing to see this behavior repeatedly. Looks like if left alone the agent reconnects within about 2-5 minutes.

@aaithal
Copy link
Contributor

aaithal commented Apr 27, 2017

Hi @tweibley, as I recently commented on #730,

Agent establishes a long running websocket connection with ECS Backend so that the backend communication service can send it state changes to apply. Since the connection authorization and authentication happens at connection establishment, we want to enforce the agent to re-authenticate and re-authorize itself periodically.

The 2 minute duration is the worst case duration in case of backoff/retry and it should be much shorter than that during the normal course of operation.

If you're seeing sustained long disconnection periods, can you please share the full Agent logs to help us debug the root-cause for the same?

You can use ECS Logs Collector and share the artifact with us by emailing them to aithal at amazon dot com.

Thanks,
Anirudh

@tweibley
Copy link
Author

I'm happy to collect some logs for you. In the example logs I provided above is the timeline really representative of the rate at which you expect those disconnects to occur (to force re-auth)?

@aaithal
Copy link
Contributor

aaithal commented Apr 27, 2017

is the timeline really representative of the rate at which you expect those disconnects to occur (to force re-auth)?

It's hard to tell, because the following lines could also be coming from 2 different websocket connections:

2017-04-27T00:33:13Z [WARN] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-04-27T00:35:07Z [WARN] ACS Connection hasn't had any activity for too long; closing connection

ECS Agent establishes two connections, one with the backend communication service, other with the metrics endpoint and both of them follow the same pattern. It's hard to tell if the first log line is coming from the metrics connection or the communication service connection and have complete Agent logs helps in that respect.

Thanks again for being willing to share the logs.

@aaithal
Copy link
Contributor

aaithal commented May 3, 2017

Hi @tweibley, thank you for sending those logs. They were extremely helpful and provide some indications as to why you're seeing sporadic long disconnection times from the backend Communication Service.

After correlating the Agent logs for the two Container Instances with our backend wire logs, I see 3 distinct occurrences of Agent remaining disconnected from the backend > 2m (2m 18s, 3m 57s and 4m 6s) on 04/27/2017. The reasons can be classified into the following categories:

  1. ECS Agent stopped/exited and was restarted by ECS Init (Contributed to 2m 18s and 4m 6s delays): The time to start the ECS Agent container contributed to the extended disconnection time. Unfortunately, the ECS Logs Collector, does not collect ECS Init logs and hence I'm unable to postulate why ECS Agent exited or why it took such a long time to start it again. I'll create PR in ECS Logs Collector to get that rectified. If, by luck, you still have ECS Init logs from 04/27/2017, please share them with me. I'll try to repro it on my end as well.
  2. ECS Agent incorrectly assumed the websocket channel as connected (Contributed to the 3m 57sdelay): The backend Communication Service disconnected the websocket channel at 02:02:34. But, ECS Agent thought that it was still connected and only discovered that the channel was inactive at 02:06:40. ECS Agent depends on a Hearbeat message being sent from the Communication Service for determining if the channel is active, but the timeout for checking the next heartbeat from the ECS Agent is currently set to 5m. I think that there's a need for the ECS Agent to be more proactive and reduce this to a value that's under a minute.

I'm marking this as a bug and will update the thread when I have more details to share.

Thanks,
Anirudh

@bpuglisi
Copy link

We just ran into this issue with the 1.14.2 agent. Here are both the ecs-init and ecs-agent logs from around this time.

---- ecs-init.log ----------------------------------------------------------------------------
2017-06-12T20:07:42Z [INFO] pre-start
2017-06-12T20:07:42Z [INFO] Downloading Amazon EC2 Container Service Agent
2017-06-12T20:07:42Z [DEBUG] Downloading published md5sum from https://s3.amazonaws.com/amazon-ecs-agent/ecs-agent-v1.14.2.tar.md5
2017-06-12T20:07:43Z [DEBUG] Downloading Amazon EC2 Container Service Agent from https://s3.amazonaws.com/amazon-ecs-agent/ecs-agent-v1.14.2.tar
2017-06-12T20:07:43Z [DEBUG] Temp file /var/cache/ecs/ecs-agent.tar994935565
2017-06-12T20:07:45Z [DEBUG] Expected 5e19a3cfdd8f5dfe259febac99a55a67
2017-06-12T20:07:45Z [DEBUG] Calculated 5e19a3cfdd8f5dfe259febac99a55a67
2017-06-12T20:07:45Z [DEBUG] Attempting to rename /var/cache/ecs/ecs-agent.tar994935565 to /var/cache/ecs/ecs-agent.tar
2017-06-12T20:07:45Z [INFO] Loading Amazon EC2 Container Service Agent into Docker
2017-06-12T20:07:46Z [INFO] start
2017-06-12T20:07:46Z [INFO] No existing agent container to remove.
2017-06-12T20:07:46Z [INFO] Starting Amazon EC2 Container Service Agent

---- ecs-agent.log --------------------------------------------------------------------------

2017-06-12T20:07:47Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.2 (35acca3)
2017-06-12T20:07:47Z [INFO] Loading configuration
2017-06-12T20:07:47Z [INFO] Checkpointing is enabled. Attempting to load state
2017-06-12T20:07:47Z [INFO] Loading state! module="statemanager"
2017-06-12T20:07:47Z [INFO] Event stream ContainerChange start listening...
2017-06-12T20:07:47Z [INFO] Detected Docker versions [1.17 1.18 1.19 1.20 1.21 1.22 1.23]
2017-06-12T20:07:47Z [INFO] Registering Instance with ECS
2017-06-12T20:07:47Z [INFO] Registered! module="api client"
2017-06-12T20:07:47Z [INFO] Registration completed successfully. I am running as 'arn:aws:ecs:XXXXXX:XXXXXXXXX:container-instance/XXXX-XXXX-XXXX-XXXX-XXX' in cluster 'XXXXXX'
2017-06-12T20:07:47Z [INFO] Saving state! module="statemanager"
2017-06-12T20:07:47Z [INFO] Beginning Polling for updates
2017-06-12T20:07:47Z [INFO] Event stream DeregisterContainerInstance start listening...
2017-06-12T20:07:47Z [INFO] Initializing stats engine
2017-06-12T20:07:47Z [INFO] NO_PROXY set:169.254.169.254,169.254.170.2,/var/run/docker.sock
2017-06-12T20:07:57Z [INFO] Saving state! module="statemanager"
2017-06-12T20:10:01Z [INFO] Handling http request module="Handlers" method="GET" from="[::1]:52228" uri="/v1/metadata"
2017-06-12T20:15:01Z [INFO] Handling http request module="Handlers" method="GET" from="[::1]:52282" uri="/v1/metadata"
2017-06-12T20:17:24Z [INFO] Saving state! module="statemanager"
2017-06-12T20:17:24Z [INFO] Pulling container XXXX(XXX) (NONE->RUNNING) concurrently. Task: XXXX:XXX arn:aws:ecs:XXXXXX:XXXX:task/XXXX-XXXX-XXXX-XXXX-XXXX, Status: (NONE->RUNNING) Containers: [XXXX (NONE->RUNNING),]
2017-06-12T20:17:47Z [INFO] Begin building map of eligible unused images for deletion
2017-06-12T20:20:01Z [INFO] Handling http request module="Handlers" method="GET" from="[::1]:52446" uri="/v1/metadata"
2017-06-12T20:21:47Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-06-12T20:25:01Z [INFO] Handling http request module="Handlers" method="GET" from="[::1]:52512" uri="/v1/metadata"
2017-06-12T20:27:43Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-06-12T20:27:46Z [INFO] Saving state! module="statemanager"
2017-06-12T20:27:47Z [INFO] Begin building map of eligible unused images for deletion
2017-06-12T20:30:01Z [INFO] Handling http request module="Handlers" method="GET" from="[::1]:52576" uri="/v1/metadata"
2017-06-12T20:35:01Z [INFO] Handling http request module="Handlers" method="GET" from="[::1]:52642" uri="/v1/metadata"
2017-06-12T20:35:47Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-06-12T20:37:47Z [INFO] Begin building map of eligible unused images for deletion
2017-06-12T20:40:01Z [INFO] Handling http request module="Handlers" method="GET" from="[::1]:52710" uri="/v1/metadata"
2017-06-12T20:45:01Z [INFO] Handling http request module="Handlers" method="GET" from="[::1]:52772" uri="/v1/metadata"
2017-06-12T20:47:47Z [INFO] Begin building map of eligible unused images for deletion
2017-06-12T20:48:47Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-06-12T20:50:01Z [INFO] Handling http request module="Handlers" method="GET" from="[::1]:52836" uri="/v1/metadata"
2017-06-12T20:51:22Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-06-12T20:51:24Z [INFO] Saving state! module="statemanager"
2017-06-12T20:55:01Z [INFO] Handling http request module="Handlers" method="GET" from="[::1]:52900" uri="/v1/metadata"
2017-06-12T20:57:47Z [INFO] Begin building map of eligible unused images for deletion

@samuelkarp
Copy link
Contributor

@bpuglisi Your issue may be unrelated to what @tweibley reported. v1.14.2 of the ECS agent has a bug leading to a deadlock; see #833. We'll be releasing v1.14.3 that fixes that issue, but until then we recommend using v1.14.1.

@abozhinov
Copy link

Hi @samuelkarp,
I have the same problem and use:
Agent version 1.14.3
Docker version 17.03.1-ce

2017-07-07T17:06:08Z [INFO] TaskHandler, Sending task change: TaskChange: arn:aws:ecs:eu-west-1:136432914479:task/90369435-dc1a-4b5d-a21e-608d9369266e -> STOPPED, Known Sent: NONE
2017-07-07T17:06:13Z [INFO] Pulling container code(136432914479.dkr.ecr.eu-west-1.amazonaws.com/code:v3.3.1) (NONE->RUNNING) concurrently. Task: WebServerTaskDefinition:117 arn:aws:ecs:eu-west-1:136432914479:task/1f9f4b9d-3d8e-44be-bb7f-c2e9fa9169e1, Status: (NONE->RUNNING) Containers: [web-server (NONE->RUNNING),code (NONE->RUNNING),]
2017-07-07T17:06:13Z [INFO] Pulling container code(136432914479.dkr.ecr.eu-west-1.amazonaws.com/code:v3.3.1) (NONE->RUNNING) concurrently. Task: WebServerTaskDefinition:117 arn:aws:ecs:eu-west-1:136432914479:task/513a02b4-1436-494d-901b-393b3c592f62, Status: (NONE->RUNNING) Containers: [code (NONE->RUNNING),web-server (NONE->RUNNING),]
2017-07-07T17:06:13Z [INFO] Updating container reference code in Image State - sha256:916941c365a9046223cb95afc8a0ba3d75c99aa0df084d3f380f885049910dc6
2017-07-07T17:06:13Z [INFO] Saving state! module="statemanager"
2017-07-07T17:06:13Z [INFO] Finished pulling container 136432914479.dkr.ecr.eu-west-1.amazonaws.com/code:v3.3.1 in 169.284246ms. Task: WebServerTaskDefinition:117 arn:aws:ecs:eu-west-1:136432914479:task/513a02b4-1436-494d-901b-393b3c592f62, Status: (NONE->RUNNING) Containers: [code (PULLED->RUNNING),web-server (NONE->RUNNING),]
2017-07-07T17:06:13Z [INFO] Creating container module="TaskEngine" task="WebServerTaskDefinition:117 arn:aws:ecs:eu-west-1:136432914479:task/513a02b4-1436-494d-901b-393b3c592f62, Status: (NONE->RUNNING) Containers: [code (PULLED->RUNNING),web-server (NONE->RUNNING),]" container="code(136432914479.dkr.ecr.eu-west-1.amazonaws.com/code:v3.3.1) (PULLED->RUNNING)"
2017-07-07T17:06:13Z [INFO] Created container name mapping for task WebServerTaskDefinition:117 arn:aws:ecs:eu-west-1:136432914479:task/513a02b4-1436-494d-901b-393b3c592f62, Status: (NONE->RUNNING) Containers: [code (PULLED->RUNNING),web-server (NONE->RUNNING),] - code(136432914479.dkr.ecr.eu-west-1.amazonaws.com/code:v3.3.1) (PULLED->RUNNING) -> ecs-WebServerTaskDefinition-117-code-92f2f6ebceffaba62700
2017-07-07T17:06:13Z [INFO] Saving state! module="statemanager"
2017-07-07T17:06:13Z [INFO] Updating container reference code in Image State - sha256:916941c365a9046223cb95afc8a0ba3d75c99aa0df084d3f380f885049910dc6
2017-07-07T17:06:13Z [INFO] Saving state! module="statemanager"
2017-07-07T17:06:13Z [INFO] Finished pulling container 136432914479.dkr.ecr.eu-west-1.amazonaws.com/code:v3.3.1 in 320.884444ms. Task: WebServerTaskDefinition:117 arn:aws:ecs:eu-west-1:136432914479:task/1f9f4b9d-3d8e-44be-bb7f-c2e9fa9169e1, Status: (NONE->RUNNING) Containers: [web-server (NONE->RUNNING),code (NONE->RUNNING),]
2017-07-07T17:06:13Z [INFO] Creating container module="TaskEngine" task="WebServerTaskDefinition:117 arn:aws:ecs:eu-west-1:136432914479:task/1f9f4b9d-3d8e-44be-bb7f-c2e9fa9169e1, Status: (NONE->RUNNING) Containers: [web-server (NONE->RUNNING),code (PULLED->RUNNING),]" container="code(136432914479.dkr.ecr.eu-west-1.amazonaws.com/code:v3.3.1) (PULLED->RUNNING)"
2017-07-07T17:06:13Z [INFO] Created container name mapping for task WebServerTaskDefinition:117 arn:aws:ecs:eu-west-1:136432914479:task/1f9f4b9d-3d8e-44be-bb7f-c2e9fa9169e1, Status: (NONE->RUNNING) Containers: [web-server (NONE->RUNNING),code (PULLED->RUNNING),] - code(136432914479.dkr.ecr.eu-west-1.amazonaws.com/code:v3.3.1) (PULLED->RUNNING) -> ecs-WebServerTaskDefinition-117-code-88fc9081e9a9b1cd5000
2017-07-07T17:06:13Z [INFO] Saving state! module="statemanager"
2017-07-07T17:06:18Z [INFO] Saving state! module="statemanager"
2017-07-07T17:09:10Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue

@adnxn
Copy link
Contributor

adnxn commented Jul 17, 2017

Hey @abozhinov - What do the agent logs show after the 2017-07-07T17:09:10Z [INFO] Connection closed for valid reason entry? The agent's behavior in response to closed websocket connection is not clear from the logs provided. Does the agent eventually reconnect? Some more information would be helpful to determine is you are seeing the same bug we're tracking in this issue.

If you are seeing extended periods of disconnection, please collect full agent logs using the ECS logs collector and email them to me at adnkha at amazon dot com. Thanks!

@richardpen
Copy link

@abozhinov @bpuglisi @tweibley We have improved this in the new released agent v1.14.4, please upgrade to the new agent version. I'm closing this for now, feel free to reopen if you run into this issue in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants