ECS agent slowly leaking memory and/or not returning it to the host system #2865

toraora · 2021-05-12T01:04:26Z

Summary

Our monitoring indicates that the ECS agent's memory usage grows unbounded over time. It eventually trips certain high-memory alert thresholds that we have set up for our servers.

Description

The actual memory utilization of the services running on our ECS cluster is fairly constant. However, notice the upward trend of memory usage in the following graph:

Inspecting top on this host shows 16GB VIRT and 15.2GB RES for the agent process.

This trend exists in some, but not all, of the machines in our cluster:

The agent is written in Go, so this may be caused by the runtime allocating heap space, garbage collecting, and then the operating system not reclaiming the memory: golang/go#42330

We're not sure though, and some guidance on how to pull debug information (pprof doesn't seem to be enabled) would be great. One thing we can test is setting GODEBUG=madvdontneed=1 to change the memory reclaiming behavior.

Expected Behavior

Memory usage of ecs-agent stays constant

Observed Behavior

Memory usage of ecs-agent grows over time

Environment Details

c5.4xlarge (16 vCPU, 32GB memory)
ECS-optimized Amazon Linux 2 AMI (kernel 4.14.225-169.362.amzn2.x86_64)
ECS agent version v1.51.0 (5c82161)
Docker version 19.03.13-ce

Agent configuration (set in userdata):

echo ECS_CLUSTER=${cluster_name} >> /etc/ecs/ecs.config
echo ECS_AVAILABLE_LOGGING_DRIVERS=[\"json-file\",\"syslog\",\"fluentd\"] >> /etc/ecs/ecs.config
echo ECS_DISABLE_IMAGE_CLEANUP=false >> /etc/ecs/ecs.config
echo ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m >> /etc/ecs/ecs.config
echo ECS_RESERVED_MEMORY=64 >> /etc/ecs/ecs.config
echo ECS_LOGLEVEL=info >> /etc/ecs/ecs.config
echo ECS_AGENT_HEALTHCHECK_HOST=127.0.0.1 >> /etc/ecs/ecs.config

The text was updated successfully, but these errors were encountered:

becryan · 2021-05-15T20:37:52Z

I have been seeing extremely high CPU usage (> 95%) and not dropping in one build compared to others in our system. The only difference between the two systems is the version of the ECS agent (with the 95% one being this version). This may explain things!

sparrc · 2021-05-17T17:53:23Z

@toraora have you been able to test with setting GODEBUG=madvdontneed=1?

We haven't upgrade to golang 1.16 yet (which turns this setting on by default), but in some initial testing we did find the reported memory usage to be significantly lower.

toraora · 2021-05-19T21:01:41Z

We have not; we deployed a cron that restarts the agent every day to avoid runaway memory usage.

This looks like it could be caused by BoltDB and its page caching behavior: boltdb/bolt#253

Similar symptoms have been noticed and investigated in LXD: https://github.com/lxc/lxd/issues/5584

sparrc · 2021-05-21T20:54:15Z

The boltdb issue you linked explicitly states that this is expected behavior: boltdb/bolt#253 (comment) . Not sure if it is that exactly but it seems likely.

We can try to repro and see if setting GODEBUG=madvdontneed=1 or using golang 1.16 solves the issue.

@toraora Are you actually seeing any OOM errors because of this? My understanding is that linux should have no issues reallocating this memory.

toraora · 2021-05-21T23:58:04Z

I don't think this has caused OOMs, but we'll keep an eye out. We have generic high memory usage alerts across our fleet of servers that we may want to re-evaluate; this is actually stated as one of the reasons for the change in Go 1.16:

This generally leads to poor user experience, like confusing stats in top and other monitoring tools; and bad integration with management systems that respond to memory usage.

sparrc · 2021-05-24T20:36:57Z

That makes sense, our next agent release will hopefully be built with golang 1.16.4, we'll let you know once it's ready.

mssrivas · 2021-08-24T21:16:03Z

#2993 is being worked in to potential deal with this issue

angelcar · 2021-08-26T22:35:50Z

Following up here. I added pprof and logged go runtime stats to see what the memory was doing (see #3001).

I then started a bunch of tasks with nginx running and let the agent run for 24 hrs, however I did not see clear signals of a memory leak. Both, the number of go routines, and heap memory in use remained pretty much flat throughout the test.

May I ask how do you measure the memory consumed by ECS Agent right now?
Also, could you broadly describe what your hosts are doing on a typical day? (i.e. how many tasks are started/stopped, are task definitions large, what log driver do you use?, what network mode?)

Thanks!

toraora · 2021-09-01T00:26:59Z

Hey,

Memory usage is measured via top, e.g.

26448 root 20 0 21.6g 20.9g 10524 S 0.0 68.3 9852:57 agent

a rough ballpark for the number of started tasks per host per day is probably about 100
each task definition consists of just two containers (the application itself and an Envoy sidecar) and a fixed set of about 20 environment variables
we are using the fluentd log driver, but the agent itself just logs to /var/log
network mode is mostly host, but we are experimenting with moving services to bridged networking

stewartcampbell · 2021-10-18T10:27:52Z

We are seeing something similar, although our ECS instance has about 5 short running tasks executing every minute. As you can see below, the agent memory usage increases over time and needs to be reset to recover the memory.

An additional point which may or may not make a difference is that each task connects to two EFS access points.

angelcar · 2021-10-18T16:43:43Z

Hi,
To further assist, could you set ECS_ENABLE_RUNTIME_STATS to true and restart the agent? Then, after the agent has been running for a while, get a heap profile by executing: curl http://localhost:51678/debug/pprof/heap > heap.pprof

Please email heap.pprof file (generated with the command above), and /var/log/ecs/runtime-stats.log to ecs-agent-external@amazon.com.

note: the ECS_ENABLE_RUNTIME_STATS configuration option was added in agent version 1.55.2

stewartcampbell · 2021-10-19T08:44:09Z

I have sent the .pprof file as requested. There is no runtime-stats.log file on the server. Thanks.

angelcar · 2021-10-19T16:38:55Z

Hi @stewartcampbell, I reviewed the memory profile you sent, but nothing stood out. Maybe the memory profile was taken too soon?

In your graph above, I see that the memory peaks in about 3 days time. Would you be able to get another memory profile after the agent has been running for 2 or 3 days?

Also, could you confirm what version of the agent you're in, as well as the OS? it's odd that the /var/log/ecs/runtime-stats.log file is not there, if ECS_ENABLE_RUNTIME_STATS is true, then it should.

stewartcampbell · 2021-10-19T16:46:04Z

Yep, sure, I'll send it again on Thursday. Agent version is 1.55.5. Thanks.

stewartcampbell · 2021-10-21T15:54:42Z

Hi that's the email sent again. You can see below stats including IO & CPU are still creeping up. The server usage pattern over that time has not changed.

Thanks.

angelcar · 2021-10-21T22:33:52Z

Ack. Will take a deeper look, thank you!

angelcar · 2021-10-27T16:45:18Z

The issue has been identified and a fixed in #3069. I will post an update, and close this issue once the changes are released.

stewartcampbell · 2021-10-28T07:46:48Z

That's great news, thanks!

fenxiong · 2021-11-08T17:33:57Z

The fix has been released in ecs agent version 1.57.0. please upgrade to that version to resolve the issue.

closing the issue. feel free to let us know if the issue is not resolved after the upgrade.

ghost · 2022-09-06T06:49:44Z

@angelcar is this issue also happen in Amazon ECS-Optimized Amazon Linux AMI ?
currently we are using Amazon ECS-Optimized Amazon Linux AMI, of which ecs-agent is 1.51.0 and will not be update in new AMI

Amazon ECS-Optimized Amazon Linux AMI --> latest ami ecs-agent 1.51.0
Amazon ECS-Optimized Amazon Linux 2 AMI --> latest ami ecs-agent 1.62.2

reference:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-versions.html

sparrc · 2022-09-06T17:04:55Z

hi @fr-herrteuer, this issue could still happen in the latest Amazon Linux AMI (AL1). It will be fixed in the latest AL2 AMI.

fcirone · 2023-01-24T14:37:48Z

we are having a similuar issue with ECS Fargate Platform 1.4.0.
checking here https://docs.aws.amazon.com/AmazonECS/latest/developerguide/platform-linux-fargate.html
the last update was "November 5, 2020"

Could the same problem affects fargate containers?

chienhanlin · 2023-05-04T23:38:26Z

Hello @fcirone , ECS Fargate Platform version 1.40 is managed by Fargate Agent, which is different from ECS Agent on EC2. To help ECS investigate the issue, please share more info and data, and send them to ecs-agent-external@amazon.com. Thank you.

mssrivas added the kind/enhancement label Jun 3, 2021

danbf mentioned this issue Jun 11, 2021

[ECS] [request]: pin the ecs-agent version rather then use amazon/amazon-ecs-agent:latest in the Amazon ECS-optimized Amazon Linux's aws/containers-roadmap#1401

Closed

sharanyad added the scope/Dependency label Jul 13, 2021

mssrivas added the kind/tracking This issue is being tracked internally label Aug 24, 2021

angelcar mentioned this issue Aug 26, 2021

Add runtime stats logger and pprof #3001

Merged

angelcar added more info needed and removed kind/enhancement scope/Dependency labels Oct 18, 2021

angelcar mentioned this issue Oct 26, 2021

Fix memory leak in stats collector #3069

Merged

angelcar added kind/bug and removed more info needed labels Oct 27, 2021

fenxiong closed this as completed Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS agent slowly leaking memory and/or not returning it to the host system #2865

ECS agent slowly leaking memory and/or not returning it to the host system #2865

toraora commented May 12, 2021

becryan commented May 15, 2021

sparrc commented May 17, 2021

toraora commented May 19, 2021

sparrc commented May 21, 2021

toraora commented May 21, 2021

sparrc commented May 24, 2021

mssrivas commented Aug 24, 2021

angelcar commented Aug 26, 2021

toraora commented Sep 1, 2021

stewartcampbell commented Oct 18, 2021 •

edited

Loading

angelcar commented Oct 18, 2021

stewartcampbell commented Oct 19, 2021 •

edited

Loading

angelcar commented Oct 19, 2021

stewartcampbell commented Oct 19, 2021 •

edited

Loading

stewartcampbell commented Oct 21, 2021

angelcar commented Oct 21, 2021

angelcar commented Oct 27, 2021

stewartcampbell commented Oct 28, 2021

fenxiong commented Nov 8, 2021

ghost commented Sep 6, 2022

sparrc commented Sep 6, 2022

fcirone commented Jan 24, 2023 •

edited

Loading

chienhanlin commented May 4, 2023

ECS agent slowly leaking memory and/or not returning it to the host system #2865

ECS agent slowly leaking memory and/or not returning it to the host system #2865

Comments

toraora commented May 12, 2021

Summary

Description

Expected Behavior

Observed Behavior

Environment Details

becryan commented May 15, 2021

sparrc commented May 17, 2021

toraora commented May 19, 2021

sparrc commented May 21, 2021

toraora commented May 21, 2021

sparrc commented May 24, 2021

mssrivas commented Aug 24, 2021

angelcar commented Aug 26, 2021

toraora commented Sep 1, 2021

stewartcampbell commented Oct 18, 2021 • edited Loading

angelcar commented Oct 18, 2021

stewartcampbell commented Oct 19, 2021 • edited Loading

angelcar commented Oct 19, 2021

stewartcampbell commented Oct 19, 2021 • edited Loading

stewartcampbell commented Oct 21, 2021

angelcar commented Oct 21, 2021

angelcar commented Oct 27, 2021

stewartcampbell commented Oct 28, 2021

fenxiong commented Nov 8, 2021

ghost commented Sep 6, 2022

sparrc commented Sep 6, 2022

fcirone commented Jan 24, 2023 • edited Loading

chienhanlin commented May 4, 2023

stewartcampbell commented Oct 18, 2021 •

edited

Loading

stewartcampbell commented Oct 19, 2021 •

edited

Loading

stewartcampbell commented Oct 19, 2021 •

edited

Loading

fcirone commented Jan 24, 2023 •

edited

Loading