Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High client memory usage #17789

Closed
Kamilcuk opened this issue Jul 3, 2023 · 13 comments
Closed

High client memory usage #17789

Kamilcuk opened this issue Jul 3, 2023 · 13 comments

Comments

@Kamilcuk
Copy link
Contributor

Kamilcuk commented Jul 3, 2023

Nomad version

$ nomad --version
Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af70885c02ab921dedbdf6bc406a1e886866f80

Operating system and Environment details

Fedora 29

Issue

The memory usage of Nomad clients constantly is going up and stays high.

Reproduction steps

Install nomad client on our cluster. Run a lot of jobs.

Expected Result

Nomad client should use close to nothing memory.

Actual Result

$ top -b -n 10 -d 1 -p $(pgrep -f 'nomad agent')
top - 04:24:27 up 41 days, 11:36,  1 user,  load average: 0.06, 0.08, 0.15
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.2 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem : 128802.9 total,  98298.4 free,  28623.8 used,   1880.8 buff/cache
MiB Swap:   8192.0 total,   3407.2 free,   4784.8 used.  99259.2 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
14939 sysavtb+  39  19   32.0g  11.3g 191204 S   6.7   9.0   1314:50 nomad

9% of memory is a lot of memory - 5G - for a process sitting around and doing nothing (currently). The VIRT and RES and SHR are also high.

Restarting the agent, reduces the memory back to 0.1%.

@tgross
Copy link
Member

tgross commented Jul 12, 2023

Hi @Kamilcuk! When you say "run a lot of jobs" do you mean that the allocations are still running when the memory is high, or that even when the allocation stop (ex. batch jobs), the memory is left behind? In the first case, this is high memory usage (which is bad), whereas in the second case that'd be a memory leak (which is worse!).

Can you take two Agent Runtime profiles from a client having this problem?f A heap profile from /agent/pprof/heap, and a goroutine profile from /agent/pprof/goroutine. You can email those to the receive-only email address nomad-oss-debug@hashicorp.com. Include a link to this issue (although I'll keep an eye out for it as well). It would also help if you could let us know how many allocations are running on that client at the time you make the profile.

@Kamilcuk
Copy link
Contributor Author

In the first case, this is high memory usage (which is bad), whereas in the second case that'd be a memory leak

Hi! I would say it is both.

how many allocations are running on that client at the time you make the profile.

I sent an email. I included the output from nomad node status.

@tgross
Copy link
Member

tgross commented Jul 12, 2023

@Kamilcuk I haven't done a detailed analysis yet but what jumps out at me immediately is that all 3 machines you sent heap dumps on have large memory allocations for Raft. This means they are servers, not just clients. The server has to keep the entire history of evaluations, jobs, allocations, etc. in the in-memory state store, until they are GC'd. And that's precisely what I see in the heap profile. Running the server and client on the same machine is not recommended, because it causes the Nomad control plane to complete for CPU, memory, and network resources with your workloads. At the very least you need to set aside a large amount of reserved memory.

@tgross
Copy link
Member

tgross commented Jul 13, 2023

Hi @Kamilcuk I did a little more follow-up and I actually don't see any client-related code in the heap or goroutine profiles you've given me. That suggests that either the client-related code is very "quiet" such that the sampling profiler doesn't see it (which is ideal in stready-state operations but a little surprising), or that you've taken profiles from servers that aren't running the client. Can you make sure you've taken the profiles from the clients and not the servers?

@Kamilcuk
Copy link
Contributor Author

Kamilcuk commented Jul 13, 2023

Thx for response, I included the script I used, it was just curl http://host:4646/agent/pprof/heap. How can I confirm for sure it is not a server then? It is not listed in nomad server members. There is no server {} block at all in nomad.hcl configuration file, and the default is false.

@tgross
Copy link
Member

tgross commented Jul 13, 2023

If you pass the node_id parameter it will make sure the request is forwarded to a specific node, regardless of the host you make the request to.

@Kamilcuk
Copy link
Contributor Author

Sorry, I did not notice that option! Resend the email, crossing fingers it is fine now.

@shantanugadgil
Copy link
Contributor

Nomad server leader bloating on memory is a thing we have observed too.

We have periodic system gc and system reconcile summaries, but that doesn't help too much.

Now we have detailed monitoring on memory, with Slack alerts at a "warning" threshold and and auto restart of the Nomad process at the "critical" threshold.

The server leader has to be restarted about every 14 days.

refs:

https://discuss.hashicorp.com/t/sudden-increase-in-nomad-server-memory/39743

#14842

@tgross
Copy link
Member

tgross commented Jul 17, 2023

This ticket is not about the servers. If you have a new report on that after 1.5.0, please open a new issue @shantanugadgil

@tgross
Copy link
Member

tgross commented Jul 17, 2023

@Kamilcuk all the new profiles are still showing server heaps and there's a bunch of other stuff I haven't asked for in here. Here's what I need you to do:

  • Set enable_debug = true on one of the clients
  • ssh onto that instance
  • hit :4646/debug/pprof/heap and :4646/debug/pprof/goroutine?debug=2 (note: without any leading v1). That hits the underlying pprof interface directly, bypassing any possible Nomad RPC forwarding.

Do this to exactly 1 Nomad client instance that's demonstrating the problem you're seeing, and zip that up without anything else.

@shantanugadgil
Copy link
Contributor

This ticket is not about the servers. If you have a new report on that after 1.5.0, please open a new issue @shantanugadgil

OK, I'll open a new issue.
Things did improve a bit after 1.4.3 but the gist of the story for me remains the same (server leader memory keeps increasing) 🙂

@tgross tgross self-assigned this Jul 28, 2023
@tgross tgross changed the title High memory usage High client memory usage Aug 2, 2023
@tgross
Copy link
Member

tgross commented Aug 7, 2023

@Kamilcuk I took a second look at the bundle you sent with all the unannotated stuff you added. I think I've figured out why you're not getting the right profiles. The script.sh file uses the query parameter nodeid, but the correct query param is node_id. So that's not getting forwarded from whatever host you're hitting via the API.

get() {
	url=$1
	set -x
-	nomad operator api "/v1/agent/pprof/$url?nodeid=$nodeid&seconds=5" > "${client}_${1}_$(date +%s).bin"
+	nomad operator api "/v1/agent/pprof/$url?node_id=$nodeid&seconds=5" > "${client}_${1}_$(date +%s).bin"
}

@Kamilcuk
Copy link
Contributor Author

Kamilcuk commented Aug 7, 2023

I am not able to reproduce the high usage after clients restart some time ago, I close this. Thanks!

@Kamilcuk Kamilcuk closed this as completed Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants