Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage on Nomad agent due to static buffer in template client? #18508

Closed
deverton-godaddy opened this issue Sep 15, 2023 · 5 comments · Fixed by #18524
Closed

High memory usage on Nomad agent due to static buffer in template client? #18508

deverton-godaddy opened this issue Sep 15, 2023 · 5 comments · Fixed by #18524

Comments

@deverton-godaddy
Copy link
Contributor

Nomad version

$ nomad version
Nomad v1.5.8
BuildDate 2023-07-21T13:48:41Z
Revision b3a653dff306e57edcd63bce6c4a54c0918f2f3d

Operating system and Environment details

Amazon Linux 2 on amd64 and aarch64

Issue

The Nomad agent has surprisingly high memory usage for our use case. We're seeing 8 GB of memory usage for about 1000 allocations on a single machine.

Reproduction steps

Not entirely sure what the exact scenario that causes this, but possible it's related to the number of templates. Each of the 1000 tasks has one template which for reasons discussed below, matches some of the memory usage numbers.

Expected Result

Not using 8GB of RAM for a 1000 allocations with one template.

Actual Result

We enabled pprof and grabbed a heap dump. The top 10 objects are shown below which has a pretty large outlier.

Showing nodes accounting for 1.43GB, 86.87% of 1.64GB total
Dropped 393 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 111
      flat  flat%   sum%        cum   cum%
    1.24GB 75.74% 75.74%     1.24GB 75.74%  google.golang.org/grpc/test/bufconn.newPipe
    0.03GB  1.92% 77.66%     0.06GB  3.80%  github.com/hashicorp/nomad/client/taskenv.(*Builder).Build
    0.02GB  1.43% 79.09%     0.03GB  1.85%  github.com/hashicorp/nomad/client/taskenv.(*Builder).buildEnv
    0.02GB  1.39% 80.48%     0.02GB  1.48%  encoding/json.(*Decoder).refill
    0.02GB  1.34% 81.82%     0.02GB  1.34%  fmt.Sprintf
    0.02GB  1.29% 83.11%     0.04GB  2.51%  github.com/hashicorp/nomad/client/taskenv.(*Builder).setNode
    0.02GB  1.14% 84.24%     0.02GB  1.22%  io.copyBuffer
    0.02GB  1.01% 85.25%     0.02GB  1.01%  github.com/armon/go-metrics.(*Metrics).filterLabels
    0.01GB  0.87% 86.12%     0.01GB  0.87%  bufio.NewReaderSize
    0.01GB  0.75% 86.87%     0.01GB  0.75%  golang.org/x/exp/maps.Clone[...]

Tracing this through the code lands on this line ln := bufconn.Listen(1024 * 1024) which

  1. seems like quite a large static buffer
  2. is using test code from the GRPC library in a non-test context

However the only place that seems to call it is agent.setupClient() which should be called once at startup.

So I'm a bit stumped on how we end up with so many instances of the buffer.

The full profile chart is below
profile001

@tgross
Copy link
Member

tgross commented Sep 15, 2023

Hi @deverton-godaddy! First, thanks so much for showing up with a heap profile here! 😀

I think you're hitting an unfortunate design constrain around template blocks, which is that we embed consul-template to run the templating engine, and each running template is its own instance of consul-template. Each template runner that connects "back" to Nomad for Nomad services or variables needs to open its own connection. That bufconn.Listen method only configures the listener, it doesn't allocate the buffers until a connection is made, and then it creates two buffers of the size (1 MiB here). That adds up quite a bit.

The heap profile you're showing shows 1272MiB of heap being used by these buffers, so I suspect maybe only ~600 of the 1000 templates are querying Nomad services or variables.

Given all that:

  • Having a MiB-sized buffer for both tx/rx does seem like a lot; it looks like this code was lifted from Vault (which does or at least used to do the same thing but not on the scale Nomad does where we have potentially 100s or 1000s of templates), so I'm going to take that back to the team to figure out whether we can safely resize that or make it a user-configurable buffer.
  • Using this bit of testing code in gRPC itself got lifted out of Vault in client: add Nomad template service functionality to runner.  #12458. I can talk to @jrasell and @schmichael to get their thoughts on how we might tighten that up.
  • Also, the entire heap in your profile is 1680MiB which is a lot less than the 8GiB RSS you're reporting, so it might also be helpful to get a goroutine profile (with ?debug=2) to verify there's no unexpected excess goroutine spawning either.

@tgross
Copy link
Member

tgross commented Sep 15, 2023

Still chatting with the team about this, but in the meanwhile I did a quick smoke test cranking down that buffer to only 10k and it worked just fine even with Variables that were 4x that size (as I'd expect). Obviously that means more read calls on that buffer so I'm not totally certain we'd recommend that yet without doing some more testing about the impact over many allocations.

But if you wanted to try it yourself on a client the diff was:

diff --git a/helper/bufconndialer/bufconndialer.go b/helper/bufconndialer/bufconndialer.go
index 2a7650d57d..22acb22758 100644
--- a/helper/bufconndialer/bufconndialer.go
+++ b/helper/bufconndialer/bufconndialer.go
@@ -24,7 +24,7 @@ type BufConnWrapper struct {
 // New returns a new BufConnWrapper with a new bufconn.Listener. The wrapper
 // provides a dialer for creating connections to the listener.
 func New() (net.Listener, *BufConnWrapper) {
-       ln := bufconn.Listen(1024 * 1024)
+       ln := bufconn.Listen(1024 * 10)
        return ln, &BufConnWrapper{listener: ln}
 }

(Or split the difference and do 1025 * 100 if that feels dodgy.)

tgross added a commit that referenced this issue Sep 15, 2023
In #12458 we added an in-memory connection buffer so that template runners that
want access to the Nomad API for Service Registration and Variables can
communicate with Nomad without having to create a real HTTP client. The size of
this buffer (1 MiB) was taken directly from its usage in Vault, and each
connection makes 2 such buffers (send and receive). Because each template runner
has its own connection, when there are large numbers of allocations this adds up
to significant memory usage.

The largest Nomad Variable payload is 64KiB, and a small amount of
metadata. Service Registration responses are much smaller, and we don't include
check results in them (as Consul does), so the size is relatively bounded. We
should be able to safely reduce the size of the buffer by a factor of 10 or more
without forcing the template runner to make multiple read calls over the buffer.

Fixes: #18508
@tgross
Copy link
Member

tgross commented Sep 15, 2023

After some discussion with the team, I've opened #18524 with a draft PR setting the buffer to 100KiB so as to allow reading the largest Variable with a single read call.

@deverton-godaddy
Copy link
Contributor Author

I'll dig in to why we see a much higher RSS at some point because I was also surprised that the heap profile didn't match with what we see at the system level. I figured I'd report the obvious outlier object first as that seemed like it might be an easy fix which it seems like it might be.

You mention that "each running template is its own instance of consul-template" which suggests if a job has say 4 templates there's 4 connections so 4 MB of buffers (before the patch)? How feasible is it that this might eventually be converted to a single (or a pool) of shared buffers? At the density and template use we're targeting, even the 10x reduction in memory usage might not be enough.

tgross added a commit that referenced this issue Sep 18, 2023
In #12458 we added an in-memory connection buffer so that template runners that
want access to the Nomad API for Service Registration and Variables can
communicate with Nomad without having to create a real HTTP client. The size of
this buffer (1 MiB) was taken directly from its usage in Vault, and each
connection makes 2 such buffers (send and receive). Because each template runner
has its own connection, when there are large numbers of allocations this adds up
to significant memory usage.

The largest Nomad Variable payload is 64KiB, and a small amount of
metadata. Service Registration responses are much smaller, and we don't include
check results in them (as Consul does), so the size is relatively bounded. We
should be able to safely reduce the size of the buffer by a factor of 10 or more
without forcing the template runner to make multiple read calls over the buffer.

Fixes: #18508
@tgross
Copy link
Member

tgross commented Sep 18, 2023

You mention that "each running template is its own instance of consul-template" which suggests if a job has say 4 templates there's 4 connections so 4 MB of buffers (before the patch)?

Correct.

How feasible is it that this might eventually be converted to a single (or a pool) of shared buffers? At the density and template use we're targeting, even the 10x reduction in memory usage might not be enough.

We're probably looking to make template runners more isolated, rather than less, as the sandbox the runners are in currently is all code in the agent, and this has been a historic source of privilege escalation bugs we'd like to solve by design rather than relying on being infallible developers (:grinning:). Having something more like what we've done with artifact blocks where the runner is in a separate process is probably where we want to go. That'd also let us do things like persist template state across client agent restarts (#13313) and keep template updates coming even when the client is down.

But we don't want to do that kind of thing and then make the memory usage problem worse, for sure, so we'd likely need to rework a lot of the consul-template internals to do that so we have a very lean process. In the meantime, I've just landed the PR to reduce the buffer size. I might try to look at making that a user-configurable option because I suspect we could even get away with 10k instead of 100k for a lot of users, but I'm not going to be able to look at that for a few weeks at best.

@tgross tgross added this to the 1.6.x milestone Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment