Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS agent keeps trying to fetch stats for non existent container #478

Closed
GeyseR opened this issue Aug 11, 2016 · 20 comments
Closed

ECS agent keeps trying to fetch stats for non existent container #478

GeyseR opened this issue Aug 11, 2016 · 20 comments
Assignees
Labels
Milestone

Comments

@GeyseR
Copy link

GeyseR commented Aug 11, 2016

Hi!

We have weird problem with ecs-agent on some of our ecs instances inside clusters.
Ecs agent indefinitely keep trying to fetch stats for non existent container.
It spams in ecs log with messages:

2016-08-11T15:35:35Z [WARN] Error retrieving stats for container a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e: No such container: a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e
2016-08-11T15:35:35Z [WARN] Error retrieving stats for container a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e: No such container: a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e
2016-08-11T15:35:35Z [WARN] Error retrieving stats for container a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e: No such container: a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e

(more that 754Mb in a hour)
... and in docker logs with

time="2016-08-11T15:48:09.729334949Z" level=error msg="Handler for GET /v1.17/containers/a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e/stats returned error: No such container: a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e" 
time="2016-08-11T15:48:09.729642368Z" level=error msg="Handler for GET /v1.17/containers/a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e/stats returned error: No such container: a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e" 
time="2016-08-11T15:48:09.729956876Z" level=error msg="Handler for GET /v1.17/containers/a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e/stats returned error: No such container: a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e" 
time="2016-08-11T15:48:09.730276370Z" level=error msg="Handler for GET /v1.17/containers/a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e/stats returned error: No such container: a6366dc2480e5516eae7a91c2696a11e4b390e7005d9d867452a65be7e58233e" 

(near to 1Gb in a hour)

Also docker + agent processes consume near to 100% CPU.

We have 1.11.0 agent on one machine and latest 1.11.1 agent on another (in another cluster)

Only change that we have in agent settings - our ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION parameter was set to 15m

Looks like very critical issue (at least for our company).
How we can avoid this? or will it be fixed?

@richardpen richardpen self-assigned this Aug 11, 2016
@richardpen
Copy link

@GeyseR Thanks for reporting this issue, we're aware of this issue and working on fixing it. Will let you know when we have an update. As a temporary fix, you can restart the agent by sudo stop ecs and sudo start ecs to get rid of the error.

@GeyseR
Copy link
Author

GeyseR commented Aug 11, 2016

What about downgrading ecs agent to lower version? (by using old ECS Optimized AMIs)
Do you know from what version this bug is appeared?

@richardpen
Copy link

@GeyseR This is an issue existed in all agent version. Currently restart agent is the only solution of this issue. We are working on fixing it, will let you know when we have an update.

@martinrehfeld
Copy link

Might this be related? We are seeing this message in the agent log:

2016-08-12T08:53:05Z [WARN] Error retrieving stats for container e278c6d145aaf164dfeec6a030e86b37f3ff30ee67f9353300d23999a114a324: io: read/write on closed pipe

The named container does exist, though.

This would not be a problem in itself, but whenever this message logged, the agent keeps one additional socket connection open, until it eventually hits the maximum of 1024 FDs.

This started to happen with the latest amzn-ami-2016.03.f-amazon-ecs-optimized AMI (we had amzn-ami-2016.03.c-amazon-ecs-optimized running before and that did not show that problem).

Shall I rather open a new issue for this?

@richardpen
Copy link

@martinrehfeld Thanks for letting us know, this is caused by the same reason in the agent. So you don't need to reopen one for this issue. We are working on a fix, will let you know when we have an update.

@GeyseR
Copy link
Author

GeyseR commented Aug 15, 2016

Hi @richardpen!

Do you have any estimate date for solving this issue? This become most critical issue for our system, because almost every deployment lead to downtime.

@lpetre
Copy link

lpetre commented Aug 15, 2016

Is it advisable to just periodically restart ecs with a cron job until this is fixed?

@GeyseR
Copy link
Author

GeyseR commented Aug 15, 2016

Yes, this might be a temporary workaround, but yesterday i had problem with starting ecs agent container after stopping ecs service. Previous container just hung with Removal in progress status and it didn't allow to start ecs service again.
Anyway, it would be better to fix in service itself. Because as i understand this can touch many users and they might not know about such workarounds.

@richardpen
Copy link

@GeyseR @lpetre @martinrehfeld We have already had a pull request #482 for this. If you'd like to use a pre-build agent version in your instance before we release our AMI, please send me an email at: penyin (at) amazon.com.

Thanks,
Peng

@samuelkarp
Copy link
Contributor

We've just released 1.12.1, which should fix this issue. Please let us know if you continue to run into problems.

@jhovell
Copy link

jhovell commented Aug 22, 2016

@samuelkarp is there an ECS AMI associated with this release or do you recommend customers manually update the agent on each host?

@samuelkarp
Copy link
Contributor

The new ECS AMI is amzn-ami-2016.03.h-amazon-ecs-optimized. We'll be updating our documentation shortly.

@jhovell
Copy link

jhovell commented Aug 22, 2016

Thanks, I'll be watching http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html for AMI details.

@samuelkarp
Copy link
Contributor

@jhovell The documentation has been updated with new AMI IDs.

@ziggythehamster
Copy link

@samuelkarp Don't forget the Marketplace page :)

@samuelkarp
Copy link
Contributor

@ziggythehamster We're coordinating with the Marketplace team to get the listing updated there.

@GeyseR
Copy link
Author

GeyseR commented Aug 25, 2016

Hi, AWS team.

Unfortunately we had same problem on latest ECS-Optimized AMI with ecs agent v1.12.1 installed.

Looks like this issue need to be reopened.

@richardpen
Copy link

@GeyseR It's still possible to see few error messages in the logs, as the container stop event may be handled right after collecting metrics(every second). But the agent shouldn't keep fetching metrics for the same stopped container. Did you see the same container id in the error logs? If you see the same container id has the error, could you send me the agent logs here or penyin (at) amazon.com?

@GeyseR
Copy link
Author

GeyseR commented Aug 25, 2016

Hi @richardpen!

Unfortunately we have exact problem with high CPU load during long period of time and a lot of identical log messages (docker+ecs).

Not sure which logs do you want to receive.
We have 5.2 Gb of ecs logs total:

$ du -h /var/log/ecs/
5.2G    /var/log/ecs/

Here is count of lines with identical messages from one of log files:

$ cat /var/log/ecs/ecs-agent.log.2016-08-25-17 | grep 'Error retrieving stats for container 8ef5467a191faf688374f9f33135443274e28d0827152675203c0c938dd41d40' | wc -l
1833938
$ cat /var/log/ecs/ecs-agent.log.2016-08-25-17 | wc -l
1834011

Let me know if you need more info...

@samuelkarp
Copy link
Contributor

@GeyseR Can you open a case with AWS Support? It sounds like there might be something going on that's specific to your setup and we'd like to dig in with you in a setting where we can discuss the specifics of your situation.

fierlion pushed a commit to fierlion/amazon-ecs-agent that referenced this issue Mar 7, 2022
* Include amazon-ecs-volume-plugin and startup scripts in Debian Package (aws#450)

* add amazon-ecs-volume-plugin to rpm generic package (aws#462)

* Fix the issue that potentially curl target not present in bucket during release

* Update copyrights

Co-authored-by: Dennis Conrad <dennis.conrad@sainsburys.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants