-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High client memory usage #17789
Comments
Hi @Kamilcuk! When you say "run a lot of jobs" do you mean that the allocations are still running when the memory is high, or that even when the allocation stop (ex. batch jobs), the memory is left behind? In the first case, this is high memory usage (which is bad), whereas in the second case that'd be a memory leak (which is worse!). Can you take two Agent Runtime profiles from a client having this problem?f A heap profile from |
Hi! I would say it is both.
I sent an email. I included the output from nomad node status. |
@Kamilcuk I haven't done a detailed analysis yet but what jumps out at me immediately is that all 3 machines you sent heap dumps on have large memory allocations for Raft. This means they are servers, not just clients. The server has to keep the entire history of evaluations, jobs, allocations, etc. in the in-memory state store, until they are GC'd. And that's precisely what I see in the heap profile. Running the server and client on the same machine is not recommended, because it causes the Nomad control plane to complete for CPU, memory, and network resources with your workloads. At the very least you need to set aside a large amount of reserved memory. |
Hi @Kamilcuk I did a little more follow-up and I actually don't see any client-related code in the heap or goroutine profiles you've given me. That suggests that either the client-related code is very "quiet" such that the sampling profiler doesn't see it (which is ideal in stready-state operations but a little surprising), or that you've taken profiles from servers that aren't running the client. Can you make sure you've taken the profiles from the clients and not the servers? |
Thx for response, I included the script I used, it was just curl http://host:4646/agent/pprof/heap. How can I confirm for sure it is not a server then? It is not listed in nomad server members. There is no server {} block at all in nomad.hcl configuration file, and the default is false. |
If you pass the |
Sorry, I did not notice that option! Resend the email, crossing fingers it is fine now. |
Nomad server leader bloating on memory is a thing we have observed too. We have periodic Now we have detailed monitoring on memory, with Slack alerts at a "warning" threshold and and auto restart of the Nomad process at the "critical" threshold. The server leader has to be restarted about every 14 days. refs: https://discuss.hashicorp.com/t/sudden-increase-in-nomad-server-memory/39743 |
This ticket is not about the servers. If you have a new report on that after 1.5.0, please open a new issue @shantanugadgil |
@Kamilcuk all the new profiles are still showing server heaps and there's a bunch of other stuff I haven't asked for in here. Here's what I need you to do:
Do this to exactly 1 Nomad client instance that's demonstrating the problem you're seeing, and zip that up without anything else. |
OK, I'll open a new issue. |
@Kamilcuk I took a second look at the bundle you sent with all the unannotated stuff you added. I think I've figured out why you're not getting the right profiles. The get() {
url=$1
set -x
- nomad operator api "/v1/agent/pprof/$url?nodeid=$nodeid&seconds=5" > "${client}_${1}_$(date +%s).bin"
+ nomad operator api "/v1/agent/pprof/$url?node_id=$nodeid&seconds=5" > "${client}_${1}_$(date +%s).bin"
} |
I am not able to reproduce the high usage after clients restart some time ago, I close this. Thanks! |
Nomad version
Operating system and Environment details
Fedora 29
Issue
The memory usage of Nomad clients constantly is going up and stays high.
Reproduction steps
Install nomad client on our cluster. Run a lot of jobs.
Expected Result
Nomad client should use close to nothing memory.
Actual Result
9% of memory is a lot of memory - 5G - for a process sitting around and doing nothing (currently). The VIRT and RES and SHR are also high.
Restarting the agent, reduces the memory back to 0.1%.
The text was updated successfully, but these errors were encountered: