High client memory usage #17789

Kamilcuk · 2023-07-03T08:28:36Z

Nomad version

$ nomad --version
Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af70885c02ab921dedbdf6bc406a1e886866f80

Operating system and Environment details

Fedora 29

Issue

The memory usage of Nomad clients constantly is going up and stays high.

Reproduction steps

Install nomad client on our cluster. Run a lot of jobs.

Expected Result

Nomad client should use close to nothing memory.

Actual Result

$ top -b -n 10 -d 1 -p $(pgrep -f 'nomad agent')
top - 04:24:27 up 41 days, 11:36,  1 user,  load average: 0.06, 0.08, 0.15
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.2 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem : 128802.9 total,  98298.4 free,  28623.8 used,   1880.8 buff/cache
MiB Swap:   8192.0 total,   3407.2 free,   4784.8 used.  99259.2 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
14939 sysavtb+  39  19   32.0g  11.3g 191204 S   6.7   9.0   1314:50 nomad

9% of memory is a lot of memory - 5G - for a process sitting around and doing nothing (currently). The VIRT and RES and SHR are also high.

Restarting the agent, reduces the memory back to 0.1%.

The text was updated successfully, but these errors were encountered:

tgross · 2023-07-12T12:57:22Z

Hi @Kamilcuk! When you say "run a lot of jobs" do you mean that the allocations are still running when the memory is high, or that even when the allocation stop (ex. batch jobs), the memory is left behind? In the first case, this is high memory usage (which is bad), whereas in the second case that'd be a memory leak (which is worse!).

Can you take two Agent Runtime profiles from a client having this problem?f A heap profile from /agent/pprof/heap, and a goroutine profile from /agent/pprof/goroutine. You can email those to the receive-only email address nomad-oss-debug@hashicorp.com. Include a link to this issue (although I'll keep an eye out for it as well). It would also help if you could let us know how many allocations are running on that client at the time you make the profile.

Kamilcuk · 2023-07-12T13:52:46Z

In the first case, this is high memory usage (which is bad), whereas in the second case that'd be a memory leak

Hi! I would say it is both.

how many allocations are running on that client at the time you make the profile.

I sent an email. I included the output from nomad node status.

tgross · 2023-07-12T14:16:01Z

@Kamilcuk I haven't done a detailed analysis yet but what jumps out at me immediately is that all 3 machines you sent heap dumps on have large memory allocations for Raft. This means they are servers, not just clients. The server has to keep the entire history of evaluations, jobs, allocations, etc. in the in-memory state store, until they are GC'd. And that's precisely what I see in the heap profile. Running the server and client on the same machine is not recommended, because it causes the Nomad control plane to complete for CPU, memory, and network resources with your workloads. At the very least you need to set aside a large amount of reserved memory.

tgross · 2023-07-13T12:35:32Z

Hi @Kamilcuk I did a little more follow-up and I actually don't see any client-related code in the heap or goroutine profiles you've given me. That suggests that either the client-related code is very "quiet" such that the sampling profiler doesn't see it (which is ideal in stready-state operations but a little surprising), or that you've taken profiles from servers that aren't running the client. Can you make sure you've taken the profiles from the clients and not the servers?

Kamilcuk · 2023-07-13T12:45:04Z

Thx for response, I included the script I used, it was just curl http://host:4646/agent/pprof/heap. How can I confirm for sure it is not a server then? It is not listed in nomad server members. There is no server {} block at all in nomad.hcl configuration file, and the default is false.

tgross · 2023-07-13T13:20:10Z

If you pass the node_id parameter it will make sure the request is forwarded to a specific node, regardless of the host you make the request to.

Kamilcuk · 2023-07-13T13:33:35Z

Sorry, I did not notice that option! Resend the email, crossing fingers it is fine now.

shantanugadgil · 2023-07-16T04:36:33Z

Nomad server leader bloating on memory is a thing we have observed too.

We have periodic system gc and system reconcile summaries, but that doesn't help too much.

Now we have detailed monitoring on memory, with Slack alerts at a "warning" threshold and and auto restart of the Nomad process at the "critical" threshold.

The server leader has to be restarted about every 14 days.

refs:

https://discuss.hashicorp.com/t/sudden-increase-in-nomad-server-memory/39743

#14842

tgross · 2023-07-17T13:01:38Z

This ticket is not about the servers. If you have a new report on that after 1.5.0, please open a new issue @shantanugadgil

tgross · 2023-07-17T13:09:53Z

@Kamilcuk all the new profiles are still showing server heaps and there's a bunch of other stuff I haven't asked for in here. Here's what I need you to do:

Set enable_debug = true on one of the clients
ssh onto that instance
hit :4646/debug/pprof/heap and :4646/debug/pprof/goroutine?debug=2 (note: without any leading v1). That hits the underlying pprof interface directly, bypassing any possible Nomad RPC forwarding.

Do this to exactly 1 Nomad client instance that's demonstrating the problem you're seeing, and zip that up without anything else.

shantanugadgil · 2023-07-17T14:15:50Z

This ticket is not about the servers. If you have a new report on that after 1.5.0, please open a new issue @shantanugadgil

OK, I'll open a new issue.
Things did improve a bit after 1.4.3 but the gist of the story for me remains the same (server leader memory keeps increasing) 🙂

tgross · 2023-08-07T20:49:19Z

@Kamilcuk I took a second look at the bundle you sent with all the unannotated stuff you added. I think I've figured out why you're not getting the right profiles. The script.sh file uses the query parameter nodeid, but the correct query param is node_id. So that's not getting forwarded from whatever host you're hitting via the API.

get() {
	url=$1
	set -x
-	nomad operator api "/v1/agent/pprof/$url?nodeid=$nodeid&seconds=5" > "${client}_${1}_$(date +%s).bin"
+	nomad operator api "/v1/agent/pprof/$url?node_id=$nodeid&seconds=5" > "${client}_${1}_$(date +%s).bin"
}

Kamilcuk · 2023-08-07T20:58:32Z

I am not able to reproduce the high usage after clients restart some time ago, I close this. Thanks!

Kamilcuk added the type/bug label Jul 3, 2023

tgross added the stage/waiting-reply label Jul 12, 2023

lgfa29 added the theme/performance label Jul 13, 2023

tgross self-assigned this Jul 28, 2023

shantanugadgil mentioned this issue Aug 1, 2023

prometheus metrics for job summaries cause ever-increasing memory on leader #18113

Open

tgross changed the title ~~High memory usage~~ High client memory usage Aug 2, 2023

Kamilcuk closed this as completed Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High client memory usage #17789

High client memory usage #17789

Kamilcuk commented Jul 3, 2023

tgross commented Jul 12, 2023 •

edited

Loading

Kamilcuk commented Jul 12, 2023

tgross commented Jul 12, 2023 •

edited

Loading

tgross commented Jul 13, 2023

Kamilcuk commented Jul 13, 2023 •

edited

Loading

tgross commented Jul 13, 2023

Kamilcuk commented Jul 13, 2023

shantanugadgil commented Jul 16, 2023

tgross commented Jul 17, 2023

tgross commented Jul 17, 2023 •

edited

Loading

shantanugadgil commented Jul 17, 2023

tgross commented Aug 7, 2023

Kamilcuk commented Aug 7, 2023

High client memory usage #17789

High client memory usage #17789

Comments

Kamilcuk commented Jul 3, 2023

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

tgross commented Jul 12, 2023 • edited Loading

Kamilcuk commented Jul 12, 2023

tgross commented Jul 12, 2023 • edited Loading

tgross commented Jul 13, 2023

Kamilcuk commented Jul 13, 2023 • edited Loading

tgross commented Jul 13, 2023

Kamilcuk commented Jul 13, 2023

shantanugadgil commented Jul 16, 2023

tgross commented Jul 17, 2023

tgross commented Jul 17, 2023 • edited Loading

shantanugadgil commented Jul 17, 2023

tgross commented Aug 7, 2023

Kamilcuk commented Aug 7, 2023

tgross commented Jul 12, 2023 •

edited

Loading

tgross commented Jul 12, 2023 •

edited

Loading

Kamilcuk commented Jul 13, 2023 •

edited

Loading

tgross commented Jul 17, 2023 •

edited

Loading