Nomad server nodes using up all available host memory #12445

robsonj · 2022-04-04T13:45:54Z

So the problem here could be that we just need more recources for each of the server nodes, we currently have 3 nodes, each with 2 cores, 16gb memory.

For a while they run just fine, then about once a week, they start to chew up all the available memory on each of their respective hosts. After which time, the only way I've found to recover the cluster, is to completely delete the serverdata directories on each of the server nodes, and re-bootstrap the ACL's.

Is this just a symptom of not enough hardware resources, or is there something else in play here?

Thank you,
Jonathan

shoenig · 2022-04-04T15:09:33Z

Hi @robsonj there is some documentation on guidance for resources here.

However this certainly seems concerning

For a while they run just fine, then about once a week, they start to chew up all the available memory on each of their respective hosts. After which time, the only way I've found to recover the cluster, is to completely delete the serverdata directories on each of the server nodes, and re-bootstrap the ACL's.

If there were a simple memory leak a simple restart of the agent should suffice; instead it sounds like there's an explosion of data happening in the persistent state store, which would then be cached in memory, of which there is less available. Any chance you've got metrics on both memory and disk usage, and see if there is a correlation?

Next time an agent starts chewing up memory, see if you can grab a heap dump and stack trace using nomad operator debug which will help track down a leak if there is one.

robsonj · 2022-04-04T17:56:36Z

Hi @shoenig I have a copy of the serverdata directory before I removed it, would its size help you in any way?

One thing I was wondering is maybe we just have too many pending jobs for the 16gb of memory. Is there a way to set a limit in nomad where it will reject job submissions if there are too many in a pending state perhaps?

robsonj · 2022-04-04T17:59:52Z

@shoenig serverdata directory is 9.8gb in size, drilling down into that directory shows 2 snapshot directories, each about 4.3gb in size, another .tmp directory inside the snapshots directory is about 349mb in size. So wondering if its going into some spin taking or restoring from a snapshot maybe?

robsonj · 2022-05-09T13:18:03Z

This occurred again yesterday, seemingly randomly, had to shut everything down, remove the server data directory, and rebuild the cluster

shantanugadgil · 2022-06-26T09:55:14Z

Something similar occurred on my setup on two separate occasions; the first time was valid, when I was running servers with very less memory (8GB).
The second time was when a Consul server outage upset the Nomad servers. When the Consul servers were back online, the Nomad server "upset" began.

On both occasions, bumping up the memory to 64GB (!!) allowed the Nomad server ASG rotation to ultimately work.

ref: https://discuss.hashicorp.com/t/sudden-increase-in-nomad-server-memory/39743/2

During the outage, when I manually add-one-delete-one node, memory consumption peaks upto 34 GB and over the course of about an hour goes down to the usual 10GB (or less) value.

shantanugadgil · 2022-11-23T13:32:16Z

Is there any specific investigation going on with the high server memory usage for specific profiles of workloads?

We have been able to contain the server memory (sort of) by running a system gc and system reconcile summaries every 30 (thirty) minutes.

robsonj · 2022-11-23T16:17:25Z

We gave up on Nomad for queueing. I believe it's storing everything internally as json and memory growth seems exponential to the number of items you have queued, we wanted to queue up about 1,000,000 items. We've moved to an external JIT queue to feed Nomad, it maintains a queue of about 1,000,000 jobs in 300mb of memory

tgross · 2022-11-23T16:36:21Z

We've got #15097 open which should fix a bug around garbage collection of evals that impacts batch workloads in particular.

tgross · 2023-01-31T18:33:21Z

Fixed in #15097 which will ship in Nomad 1.5.0 (with backports)

shoenig self-assigned this Apr 4, 2022

shoenig added stage/waiting-reply type/bug labels Apr 4, 2022

shoenig added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 4, 2022

shoenig moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Apr 4, 2022

tgross linked a pull request Nov 23, 2022 that will close this issue

[15090] Ensure no leakage of evaluations for batch jobs. #15097

Merged

tgross closed this as completed in #15097 Jan 31, 2023

Nomad - Community Issues Triage automation moved this from Triaging to Done Jan 31, 2023

tgross added this to the 1.5.0 milestone Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad server nodes using up all available host memory #12445

Nomad server nodes using up all available host memory #12445

robsonj commented Apr 4, 2022

shoenig commented Apr 4, 2022

robsonj commented Apr 4, 2022

robsonj commented Apr 4, 2022

robsonj commented May 9, 2022

shantanugadgil commented Jun 26, 2022

shantanugadgil commented Nov 23, 2022

robsonj commented Nov 23, 2022

tgross commented Nov 23, 2022

tgross commented Jan 31, 2023

Nomad server nodes using up all available host memory #12445

Nomad server nodes using up all available host memory #12445

Comments

robsonj commented Apr 4, 2022

shoenig commented Apr 4, 2022

robsonj commented Apr 4, 2022

robsonj commented Apr 4, 2022

robsonj commented May 9, 2022

shantanugadgil commented Jun 26, 2022

shantanugadgil commented Nov 23, 2022

robsonj commented Nov 23, 2022

tgross commented Nov 23, 2022

tgross commented Jan 31, 2023