Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS agent slowly leaking memory and/or not returning it to the host system #2865

Closed
toraora opened this issue May 12, 2021 · 23 comments
Closed
Labels
kind/bug kind/tracking This issue is being tracked internally

Comments

@toraora
Copy link

toraora commented May 12, 2021

Summary

Our monitoring indicates that the ECS agent's memory usage grows unbounded over time. It eventually trips certain high-memory alert thresholds that we have set up for our servers.

Description

The actual memory utilization of the services running on our ECS cluster is fairly constant. However, notice the upward trend of memory usage in the following graph:

ecs-memory-leak-3

Inspecting top on this host shows 16GB VIRT and 15.2GB RES for the agent process.

This trend exists in some, but not all, of the machines in our cluster:

ecs-memory-leak-2

The agent is written in Go, so this may be caused by the runtime allocating heap space, garbage collecting, and then the operating system not reclaiming the memory: golang/go#42330

We're not sure though, and some guidance on how to pull debug information (pprof doesn't seem to be enabled) would be great. One thing we can test is setting GODEBUG=madvdontneed=1 to change the memory reclaiming behavior.

Expected Behavior

Memory usage of ecs-agent stays constant

Observed Behavior

Memory usage of ecs-agent grows over time

Environment Details

  • c5.4xlarge (16 vCPU, 32GB memory)
  • ECS-optimized Amazon Linux 2 AMI (kernel 4.14.225-169.362.amzn2.x86_64)
  • ECS agent version v1.51.0 (5c82161)
  • Docker version 19.03.13-ce

Agent configuration (set in userdata):

echo ECS_CLUSTER=${cluster_name} >> /etc/ecs/ecs.config
echo ECS_AVAILABLE_LOGGING_DRIVERS=[\"json-file\",\"syslog\",\"fluentd\"] >> /etc/ecs/ecs.config
echo ECS_DISABLE_IMAGE_CLEANUP=false >> /etc/ecs/ecs.config
echo ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m >> /etc/ecs/ecs.config
echo ECS_RESERVED_MEMORY=64 >> /etc/ecs/ecs.config
echo ECS_LOGLEVEL=info >> /etc/ecs/ecs.config
echo ECS_AGENT_HEALTHCHECK_HOST=127.0.0.1 >> /etc/ecs/ecs.config
@becryan
Copy link

becryan commented May 15, 2021

I have been seeing extremely high CPU usage (> 95%) and not dropping in one build compared to others in our system. The only difference between the two systems is the version of the ECS agent (with the 95% one being this version). This may explain things!

@sparrc
Copy link
Contributor

sparrc commented May 17, 2021

@toraora have you been able to test with setting GODEBUG=madvdontneed=1?

We haven't upgrade to golang 1.16 yet (which turns this setting on by default), but in some initial testing we did find the reported memory usage to be significantly lower.

@toraora
Copy link
Author

toraora commented May 19, 2021

We have not; we deployed a cron that restarts the agent every day to avoid runaway memory usage.

This looks like it could be caused by BoltDB and its page caching behavior: boltdb/bolt#253

Similar symptoms have been noticed and investigated in LXD: https://github.com/lxc/lxd/issues/5584

@sparrc
Copy link
Contributor

sparrc commented May 21, 2021

The boltdb issue you linked explicitly states that this is expected behavior: boltdb/bolt#253 (comment) . Not sure if it is that exactly but it seems likely.

We can try to repro and see if setting GODEBUG=madvdontneed=1 or using golang 1.16 solves the issue.

@toraora Are you actually seeing any OOM errors because of this? My understanding is that linux should have no issues reallocating this memory.

@toraora
Copy link
Author

toraora commented May 21, 2021

I don't think this has caused OOMs, but we'll keep an eye out. We have generic high memory usage alerts across our fleet of servers that we may want to re-evaluate; this is actually stated as one of the reasons for the change in Go 1.16:

This generally leads to poor user experience, like confusing stats in top and other monitoring tools; and bad integration with management systems that respond to memory usage.

@sparrc
Copy link
Contributor

sparrc commented May 24, 2021

That makes sense, our next agent release will hopefully be built with golang 1.16.4, we'll let you know once it's ready.

@mssrivas
Copy link
Contributor

#2993 is being worked in to potential deal with this issue

@mssrivas mssrivas added the kind/tracking This issue is being tracked internally label Aug 24, 2021
@angelcar
Copy link
Contributor

Following up here. I added pprof and logged go runtime stats to see what the memory was doing (see #3001).

I then started a bunch of tasks with nginx running and let the agent run for 24 hrs, however I did not see clear signals of a memory leak. Both, the number of go routines, and heap memory in use remained pretty much flat throughout the test.

May I ask how do you measure the memory consumed by ECS Agent right now?
Also, could you broadly describe what your hosts are doing on a typical day? (i.e. how many tasks are started/stopped, are task definitions large, what log driver do you use?, what network mode?)

Thanks!

@toraora
Copy link
Author

toraora commented Sep 1, 2021

Hey,

Memory usage is measured via top, e.g.

26448 root 20 0 21.6g 20.9g 10524 S 0.0 68.3 9852:57 agent

  • a rough ballpark for the number of started tasks per host per day is probably about 100
  • each task definition consists of just two containers (the application itself and an Envoy sidecar) and a fixed set of about 20 environment variables
  • we are using the fluentd log driver, but the agent itself just logs to /var/log
  • network mode is mostly host, but we are experimenting with moving services to bridged networking

@stewartcampbell
Copy link

stewartcampbell commented Oct 18, 2021

We are seeing something similar, although our ECS instance has about 5 short running tasks executing every minute. As you can see below, the agent memory usage increases over time and needs to be reset to recover the memory.

An additional point which may or may not make a difference is that each task connects to two EFS access points.

image

@angelcar
Copy link
Contributor

Hi,
To further assist, could you set ECS_ENABLE_RUNTIME_STATS to true and restart the agent? Then, after the agent has been running for a while, get a heap profile by executing: curl http://localhost:51678/debug/pprof/heap > heap.pprof

Please email heap.pprof file (generated with the command above), and /var/log/ecs/runtime-stats.log to ecs-agent-external@amazon.com.

note: the ECS_ENABLE_RUNTIME_STATS configuration option was added in agent version 1.55.2

@stewartcampbell
Copy link

stewartcampbell commented Oct 19, 2021

I have sent the .pprof file as requested. There is no runtime-stats.log file on the server. Thanks.

@angelcar
Copy link
Contributor

Hi @stewartcampbell, I reviewed the memory profile you sent, but nothing stood out. Maybe the memory profile was taken too soon?

In your graph above, I see that the memory peaks in about 3 days time. Would you be able to get another memory profile after the agent has been running for 2 or 3 days?

Also, could you confirm what version of the agent you're in, as well as the OS? it's odd that the /var/log/ecs/runtime-stats.log file is not there, if ECS_ENABLE_RUNTIME_STATS is true, then it should.

@stewartcampbell
Copy link

stewartcampbell commented Oct 19, 2021

Yep, sure, I'll send it again on Thursday. Agent version is 1.55.5. Thanks.

@stewartcampbell
Copy link

Hi that's the email sent again. You can see below stats including IO & CPU are still creeping up. The server usage pattern over that time has not changed.

image

Thanks.

@angelcar
Copy link
Contributor

Ack. Will take a deeper look, thank you!

@angelcar
Copy link
Contributor

The issue has been identified and a fixed in #3069. I will post an update, and close this issue once the changes are released.

@stewartcampbell
Copy link

That's great news, thanks!

@fenxiong
Copy link
Contributor

fenxiong commented Nov 8, 2021

The fix has been released in ecs agent version 1.57.0. please upgrade to that version to resolve the issue.

closing the issue. feel free to let us know if the issue is not resolved after the upgrade.

@fenxiong fenxiong closed this as completed Nov 8, 2021
@ghost
Copy link

ghost commented Sep 6, 2022

@angelcar is this issue also happen in Amazon ECS-Optimized Amazon Linux AMI ?
currently we are using Amazon ECS-Optimized Amazon Linux AMI, of which ecs-agent is 1.51.0 and will not be update in new AMI

  • Amazon ECS-Optimized Amazon Linux AMI --> latest ami ecs-agent 1.51.0
  • Amazon ECS-Optimized Amazon Linux 2 AMI --> latest ami ecs-agent 1.62.2

reference:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-versions.html

@sparrc
Copy link
Contributor

sparrc commented Sep 6, 2022

hi @fr-herrteuer, this issue could still happen in the latest Amazon Linux AMI (AL1). It will be fixed in the latest AL2 AMI.

@fcirone
Copy link

fcirone commented Jan 24, 2023

we are having a similuar issue with ECS Fargate Platform 1.4.0.
checking here https://docs.aws.amazon.com/AmazonECS/latest/developerguide/platform-linux-fargate.html
the last update was "November 5, 2020"

Could the same problem affects fargate containers?

@chienhanlin
Copy link
Contributor

Hello @fcirone , ECS Fargate Platform version 1.40 is managed by Fargate Agent, which is different from ECS Agent on EC2. To help ECS investigate the issue, please share more info and data, and send them to ecs-agent-external@amazon.com. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind/tracking This issue is being tracked internally
Projects
None yet
Development

No branches or pull requests

10 participants