-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS agent slowly leaking memory and/or not returning it to the host system #2865
Comments
I have been seeing extremely high CPU usage (> 95%) and not dropping in one build compared to others in our system. The only difference between the two systems is the version of the ECS agent (with the 95% one being this version). This may explain things! |
@toraora have you been able to test with setting We haven't upgrade to golang 1.16 yet (which turns this setting on by default), but in some initial testing we did find the reported memory usage to be significantly lower. |
We have not; we deployed a cron that restarts the agent every day to avoid runaway memory usage. This looks like it could be caused by BoltDB and its page caching behavior: boltdb/bolt#253 Similar symptoms have been noticed and investigated in LXD: https://github.com/lxc/lxd/issues/5584 |
The boltdb issue you linked explicitly states that this is expected behavior: boltdb/bolt#253 (comment) . Not sure if it is that exactly but it seems likely. We can try to repro and see if setting @toraora Are you actually seeing any OOM errors because of this? My understanding is that linux should have no issues reallocating this memory. |
I don't think this has caused OOMs, but we'll keep an eye out. We have generic high memory usage alerts across our fleet of servers that we may want to re-evaluate; this is actually stated as one of the reasons for the change in Go 1.16:
|
That makes sense, our next agent release will hopefully be built with golang 1.16.4, we'll let you know once it's ready. |
#2993 is being worked in to potential deal with this issue |
Following up here. I added pprof and logged go runtime stats to see what the memory was doing (see #3001). I then started a bunch of tasks with nginx running and let the agent run for 24 hrs, however I did not see clear signals of a memory leak. Both, the number of go routines, and heap memory in use remained pretty much flat throughout the test. May I ask how do you measure the memory consumed by ECS Agent right now? Thanks! |
Hey, Memory usage is measured via
|
We are seeing something similar, although our ECS instance has about 5 short running tasks executing every minute. As you can see below, the agent memory usage increases over time and needs to be reset to recover the memory. An additional point which may or may not make a difference is that each task connects to two EFS access points. |
Hi, Please email note: the |
I have sent the .pprof file as requested. There is no runtime-stats.log file on the server. Thanks. |
Hi @stewartcampbell, I reviewed the memory profile you sent, but nothing stood out. Maybe the memory profile was taken too soon? In your graph above, I see that the memory peaks in about 3 days time. Would you be able to get another memory profile after the agent has been running for 2 or 3 days? Also, could you confirm what version of the agent you're in, as well as the OS? it's odd that the |
Yep, sure, I'll send it again on Thursday. Agent version is 1.55.5. Thanks. |
Ack. Will take a deeper look, thank you! |
The issue has been identified and a fixed in #3069. I will post an update, and close this issue once the changes are released. |
That's great news, thanks! |
The fix has been released in ecs agent version 1.57.0. please upgrade to that version to resolve the issue. closing the issue. feel free to let us know if the issue is not resolved after the upgrade. |
@angelcar is this issue also happen in Amazon ECS-Optimized Amazon Linux AMI ?
reference: |
hi @fr-herrteuer, this issue could still happen in the latest Amazon Linux AMI (AL1). It will be fixed in the latest AL2 AMI. |
we are having a similuar issue with ECS Fargate Platform 1.4.0. Could the same problem affects fargate containers? |
Hello @fcirone , ECS Fargate Platform version 1.40 is managed by Fargate Agent, which is different from ECS Agent on EC2. To help ECS investigate the issue, please share more info and data, and send them to ecs-agent-external@amazon.com. Thank you. |
Summary
Our monitoring indicates that the ECS agent's memory usage grows unbounded over time. It eventually trips certain high-memory alert thresholds that we have set up for our servers.
Description
The actual memory utilization of the services running on our ECS cluster is fairly constant. However, notice the upward trend of memory usage in the following graph:
Inspecting
top
on this host shows 16GB VIRT and 15.2GB RES for the agent process.This trend exists in some, but not all, of the machines in our cluster:
The agent is written in Go, so this may be caused by the runtime allocating heap space, garbage collecting, and then the operating system not reclaiming the memory: golang/go#42330
We're not sure though, and some guidance on how to pull debug information (pprof doesn't seem to be enabled) would be great. One thing we can test is setting
GODEBUG=madvdontneed=1
to change the memory reclaiming behavior.Expected Behavior
Memory usage of ecs-agent stays constant
Observed Behavior
Memory usage of ecs-agent grows over time
Environment Details
Agent configuration (set in userdata):
The text was updated successfully, but these errors were encountered: