Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARN messages when no Tasks are scheduled #506

Closed
miketheman opened this issue Aug 24, 2016 · 10 comments
Closed

WARN messages when no Tasks are scheduled #506

miketheman opened this issue Aug 24, 2016 · 10 comments
Labels
Milestone

Comments

@miketheman
Copy link

During routine cluster management, we tend to bring up extra capacity in our cluster to be ready to accept new scheduled tasks on these instances.

We routinely see a behavior where the instance isn't running any scheduled tasks, yet emits WARN logs that look like this:

 [WARN] Error getting instance metrics: No task metrics to report

Now, I realize this may have something to do with the detection of other containers running on the instance. We run a per-container-instance Agent for Task containers to communicate with via host networking, similar to the approach described in the AWS Blog post.

Is the ECS Agent detecting the other running container, making the instance not idle and then failing to collect task related metrics, since there are no ECS-managed tasks?

ref links:

@samuelkarp
Copy link
Contributor

@miketheman We just released 1.12.1 which should have addressed a number of problems related to this. Were you seeing this with 1.12.1 or with a previous version?

@miketheman
Copy link
Author

Hi @samuelkarp ! Indeed, this was observed during bringing up new instances with the latest Agent version.

@samuelkarp
Copy link
Contributor

@miketheman Thanks for confirming. Can you share the logs you're seeing? I tried (trivially) to reproduce this with a new instance running our 2016.03.h AMI and I'm not seeing any of those WARNs. If you're not comfortable sharing publicly, can you send them to me at skarp (at) amazon.com?

@miketheman
Copy link
Author

@samuelkarp Here's an example:

2016-08-27T00:06:14Z [INFO] Creating poll dialer, host: ecs-t-1.us-east-1.amazonaws.com
2016-08-27T00:06:14Z [WARN] Error getting cpu stats, err: No data in the queue, container: &{64fe0843f0cd1c187b56cfe4fcc4a77d87eaede54396c6494702ff9370e84c30}
2016-08-27T00:06:14Z [WARN] Error getting instance metrics: No task metrics to report
2016-08-27T00:11:55Z [INFO] Creating poll dialer, host: ecs-a-1.us-east-1.amazonaws.com

Instance details:

[ec2-user@ip-something ~]$ uname -a
Linux ip-10-240-110-153 4.4.16-27.56.amzn1.x86_64 #1 SMP Fri Aug 12 23:25:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[ec2-user@ip-something ~]$ cat /etc/issue
Amazon Linux AMI release 2016.03
Kernel \r on an \m
[ec2-user@ip-something ~]$ curl http://169.254.169.254/latest/meta-data/ami-id ; echo \n
ami-6bb2d67cn
[ec2-user@ip-something ~]$ rpm -qi ecs-init | grep Version
Version     : 1.12.1

@samuelkarp
Copy link
Contributor

@miketheman This looks like maybe we didn't fully fix #478. Can you provide the following information? Some of this might be more sensitive, so you can either send it to me by email at skarp (at) amazon.com or open a case with AWS Support. If there is anything like credentials or auth tokens in the logs (from environment variables or command/entrypoint), please redact them. All of this should come from an instance that is currently affected:

  • Full (unfiltered, not truncated) logs for the agent (it would be awesome if they're at debug level)
  • The agent state file (located at /var/lib/ecs/data/ecs_agent_data.json)
  • The output of docker ps
  • The agent log file after running docker kill -s USR1 ecs-agent (this will emit a stack trace into the logs)

I'd like to correlate the container that shows up in the logs (64fe0843f0cd1c187b56cfe4fcc4a77d87eaede54396c6494702ff9370e84c30 in the log above) with the rest of what is happening (like the task it belongs to) and see if we can find what caused the agent to not get data (maybe the container is no longer running?).

@miketheman
Copy link
Author

@samuelkarp I have sent the logs to your amazon email address.

@samuelkarp
Copy link
Contributor

@miketheman Thank you for sending all that information! I think I've narrowed this down to occurring when the agent is disconnected from and reconnects to the websocket it uses for reporting metrics. On a reconnect, it appears that the very first time it attempts to send metric data it emits this warning. I've now been able to reproduce this behavior myself, so we should be able to take it from here. Thank you for reporting this issue!

@abramche
Copy link

abramche commented Dec 8, 2016

Still reproduces on 1.13.1

@samuelkarp
Copy link
Contributor

@EugeneAbramchuk Thanks. We haven't fixed this yet since the only problem here is just a spurious WARN message. We'll keep this issue updated as it gets fixed, or if you're looking for something to contribute this would be something we'd accept.

To restate what we think is going on a bit more clearly:

  1. The ECS agent opens a channel with docker stats
  2. The ECS agent opens a connection to TCS (the backend component of ECS that receives metric data)
  3. The ECS agent tries to publish metrics to TCS
  4. The ECS agent did not receive any stats from Docker yet, so queue is empty
  5. Publish aborts with the above error message

There is no impact other than it being an annoying WARN message. However, there is a case where docker stats is completely broken and the stats queue is always empty. In this case we want to raise the alarm, but for this we need to define SLA on docker-stats. We don't have this yet so we cannot tell if docker stats is working as we expect or not.

@spy-tech
Copy link

@samuelkarp What's the ETA on this fix?

liwenwu-amazon added a commit to liwenwu-amazon/amazon-ecs-agent that referenced this issue Dec 29, 2016
liwenwu-amazon added a commit to liwenwu-amazon/amazon-ecs-agent that referenced this issue Dec 29, 2016
liwenwu-amazon added a commit to liwenwu-amazon/amazon-ecs-agent that referenced this issue Jan 3, 2017
liwenwu-amazon added a commit to liwenwu-amazon/amazon-ecs-agent that referenced this issue Jan 4, 2017
liwenwu-amazon added a commit to liwenwu-amazon/amazon-ecs-agent that referenced this issue Jan 5, 2017
liwenwu-amazon added a commit to liwenwu-amazon/amazon-ecs-agent that referenced this issue Jan 5, 2017
liwenwu-amazon added a commit to liwenwu-amazon/amazon-ecs-agent that referenced this issue Jan 11, 2017
@samuelkarp samuelkarp added this to the 1.14.1 milestone Feb 1, 2017
jwerak pushed a commit to appuri/amazon-ecs-agent that referenced this issue Jun 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants