-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ecs-agent container exiting or crashing or stopped #730
Comments
I have a restart policy of |
@jamesongithub |
This crash seems have same root cause as #707 |
@liwenwu-amazon i don't having traces, im unable to repo consistently. those logs are all i have. curious why you think #707 is the same issue. |
@jamesongithub From the logs you provided, I didn't see any abnormal message. The connection is false because the agent periodically disconnects and connects back if there is no activity for a long time. How long does the connection stay in false, the agent should connected back in less than 2 minutes, if the connection stay in false for a long time, then there may be some problem, if that's the case please send me the logs of ecs-agent. Also could you tell me how this behavior impacted your production? Thanks, |
Hi @richardpen. After it becomes false, I have never seen it connect again, unless we manually go in an restart the agent. Why does the agent disconnect when there is no activity? What happens when a task needs to be spawned on that host during the 2 minutes? All the logs I have are attached. After the agent disconnected, there were no more logs. This of course impacts us. The host basically doesn't accept new tasks. |
@jamesongithub I'm sorry that you're still experiencing trouble. Do you have output from the ecs-init logs? Can you also send the output of docker info? If you do not feel comfortable sharing those logs on this github issue you can e-mail them to me at abbrcale at amazon.com |
docker info
We don't have any ecs-init logs, since we're not using the AWS ecs-optimized ami's. Any idea for my questions from the last post? Seems strange that agents disconnecting just because there is no activity is regular behavior. |
Hi @jamesongithub, to answer some of the questions you asked in this thread:
Agent establishes a long running websocket connection with ECS Backend so that the backend communication service can send it state changes to apply. Since the connection authorization and authentication happens at connection establishment, we want to enforce the agent to re-authenticate and re-authorize itself periodically. The As for as determining why the Agent has been disconnected for an extended duration of time, I did take a look at the logs that you sent to @richardpen and couldn't find anything that was obviously wrong with the Agent. I'm sorry to ask this of you again, but when you see this issue resurface, could you please make the Agent emit a stack trace by running Thanks, |
@jamesongithub Just checking in; are you still experiencing problems? If so, were you able to run |
hi, i havent seen it recently. however we're running the ecs-agent container with a |
Interesting, does that mean that the agent was crashing/exiting instead of just remaining disconnected? |
probably. ive seen the latter, but it was a few versions ago. |
@jamesongithub Thanks. I'm going to close this issue for now since you're no longer seeing problems, though if you do experience problems again please let us know. In order for us to root-cause and understand what's going on, we'll need the detailed logging/debugging information that @aaithal asked for. |
@samuelkarp @aaithal I just hit the again. Agents running, but disconnected. Can we reopen? I've sent logs to @aaithal and @cabbruzzese. If you give your email @samuelkarp I can send it to you as well. |
Yep, I'll reopen. I can ask @aaithal and @cabbruzzese for the logs. |
@jamesongithub Can you send them directly to me at skarp at amazon.com? |
@samuelkarp done. thank you. |
@jamesongithub I haven't received the logs. Can you try maybe sending from a different email address? |
@jamesongithub Can you try sending the logs again? |
hi @samuelkarp, just sent over 2 emails. 1 attachment is ~24MB. the other ~ 10MB. haven't received any rejection emails. Is there a maximum size on your end? |
@jamesongithub Could be...I only received the email with the ~10 MB attachment. |
ok i broke them up and resent. thanks! |
Hi @jamesongithub, Thank you for sending the logs. I've spent some time over the past few days looking at them and wanted to update you with what I found and what is still remaining to be determined. From your logs, the following line (repeated many times) stood out to me:
From this, I've been able to determine that you're affected by one known bug, but I also believe that there are potentially two other bugs affecting you. The known bug involves extended retries for requests not being properly resigned; this is fixed by #786 and will be released with version 1.14.2 of the agent. The first potential bug involves why some requests relating to the specific task above failed and got into extended retries. Along with the retries that I was seeing, I also saw some other network-related failures in the logs you sent. I've been looking for the requests I expect to see in the ECS service logs and haven't been able to find them so far (though I'm not done looking yet). Are you running the ECS agent behind an HTTP proxy? If so, do you know if the proxy experienced some sort of failure around 2017-05-21T00:15:05Z? The second potential bug involves restoring after a restart. What I was able to see from the ECS service logs lead me to believe that the ECS agent may have restarted around 2017-05-24T02:03:53Z. Unfortunately, that time period is missing from the logs that you sent me. If the agent did restart at 2017-05-24T02:03:53Z and then fail to reconnect, that may indicate a different codepath is at fault. In order to get more information on this, I'd like to look at the ECS agent's state file, which is normally located at Sam |
Hi @samuelkarp, Thanks for the analysis. Yes, so we had a network issue around 2017-05-21T00:15:05Z. Basically our NAT failed, which prevented the agent from connected to ECS. During this time, it probably went into extended retries. After the network issue was fixed, the agent still wasn't able to join the cluster. For the 2nd issue, I restarted the agent to see if it would help, but once it gets into that state, it seems to never recover. I'm not when the restart was. I think it was the 22nd or 23rd. I'm sending you the agent state file by email. Thanks |
@jamesongithub We have released the 1.14.3 including the fix #786. Can you upgrade to the latest version and see if the problem still exists. Thanks, |
yes, we've been running it for a weeks now. haven't seen it, although haven't had any network issues yet. |
@jamesongithub I'm closing this for now, feel free to reopen it if you run into this issue in the future. |
just upgraded ecs-agent to 1.14 and docker to 1.13.1 and starting seeing random ecs-agent container disconnecting from ecs-cluster
ecs-agent logs
audit log
docker log
The text was updated successfully, but these errors were encountered: