Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad server nodes using up all available host memory #12445

Closed
robsonj opened this issue Apr 4, 2022 · 9 comments · Fixed by #15097
Closed

Nomad server nodes using up all available host memory #12445

robsonj opened this issue Apr 4, 2022 · 9 comments · Fixed by #15097

Comments

@robsonj
Copy link

robsonj commented Apr 4, 2022

So the problem here could be that we just need more recources for each of the server nodes, we currently have 3 nodes, each with 2 cores, 16gb memory.

For a while they run just fine, then about once a week, they start to chew up all the available memory on each of their respective hosts. After which time, the only way I've found to recover the cluster, is to completely delete the serverdata directories on each of the server nodes, and re-bootstrap the ACL's.

Is this just a symptom of not enough hardware resources, or is there something else in play here?

Thank you,
Jonathan

@shoenig
Copy link
Member

shoenig commented Apr 4, 2022

Hi @robsonj there is some documentation on guidance for resources here.

However this certainly seems concerning

For a while they run just fine, then about once a week, they start to chew up all the available memory on each of their respective hosts. After which time, the only way I've found to recover the cluster, is to completely delete the serverdata directories on each of the server nodes, and re-bootstrap the ACL's.

If there were a simple memory leak a simple restart of the agent should suffice; instead it sounds like there's an explosion of data happening in the persistent state store, which would then be cached in memory, of which there is less available. Any chance you've got metrics on both memory and disk usage, and see if there is a correlation?

Next time an agent starts chewing up memory, see if you can grab a heap dump and stack trace using nomad operator debug which will help track down a leak if there is one.

@shoenig shoenig self-assigned this Apr 4, 2022
@shoenig shoenig added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 4, 2022
@shoenig shoenig moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Apr 4, 2022
@robsonj
Copy link
Author

robsonj commented Apr 4, 2022

Hi @shoenig I have a copy of the serverdata directory before I removed it, would its size help you in any way?

One thing I was wondering is maybe we just have too many pending jobs for the 16gb of memory. Is there a way to set a limit in nomad where it will reject job submissions if there are too many in a pending state perhaps?

@robsonj
Copy link
Author

robsonj commented Apr 4, 2022

@shoenig serverdata directory is 9.8gb in size, drilling down into that directory shows 2 snapshot directories, each about 4.3gb in size, another .tmp directory inside the snapshots directory is about 349mb in size. So wondering if its going into some spin taking or restoring from a snapshot maybe?

@robsonj
Copy link
Author

robsonj commented May 9, 2022

This occurred again yesterday, seemingly randomly, had to shut everything down, remove the server data directory, and rebuild the cluster

@shantanugadgil
Copy link
Contributor

Something similar occurred on my setup on two separate occasions; the first time was valid, when I was running servers with very less memory (8GB).
The second time was when a Consul server outage upset the Nomad servers. When the Consul servers were back online, the Nomad server "upset" began.

On both occasions, bumping up the memory to 64GB (!!) allowed the Nomad server ASG rotation to ultimately work.

ref: https://discuss.hashicorp.com/t/sudden-increase-in-nomad-server-memory/39743/2

During the outage, when I manually add-one-delete-one node, memory consumption peaks upto 34 GB and over the course of about an hour goes down to the usual 10GB (or less) value.

@shantanugadgil
Copy link
Contributor

Is there any specific investigation going on with the high server memory usage for specific profiles of workloads?

We have been able to contain the server memory (sort of) by running a system gc and system reconcile summaries every 30 (thirty) minutes.

@robsonj
Copy link
Author

robsonj commented Nov 23, 2022

We gave up on Nomad for queueing. I believe it's storing everything internally as json and memory growth seems exponential to the number of items you have queued, we wanted to queue up about 1,000,000 items. We've moved to an external JIT queue to feed Nomad, it maintains a queue of about 1,000,000 jobs in 300mb of memory

@tgross
Copy link
Member

tgross commented Nov 23, 2022

We've got #15097 open which should fix a bug around garbage collection of evals that impacts batch workloads in particular.

@tgross tgross linked a pull request Nov 23, 2022 that will close this issue
Nomad - Community Issues Triage automation moved this from Triaging to Done Jan 31, 2023
@tgross
Copy link
Member

tgross commented Jan 31, 2023

Fixed in #15097 which will ship in Nomad 1.5.0 (with backports)

@tgross tgross added this to the 1.5.0 milestone Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

4 participants