Memory leak #3420

jzvelc · 2017-10-19T10:18:28Z

Nomad version

0.6.3

Operating system and Environment details

Ubuntu 16.04.03 LTS (GNU/Linux 4.4.0-1038-aws x86_64)

Issue

We are running 3 nomad servers and 5 nomad clients.
On daily basis we experience issues with nomad clients crashing or consuming all available host memory. Issues started at Oct 19 07:45:28. I suspect this is related with permissions and GC when running nomad as a non root user (failed to remove alloc dir - permission denied).

Artifact was downloaded and extracted to local/data/app/cache.
This folder is then mounted to /data/app/cache:

volumes = [
    "local/data/app/cache:/data/app/cache"
]

This probably causes permission issues since uid and gid are wrong:
/var/lib/nomad/alloc/6a536fb2-e1b0-606c-5146-ff3ccd5023ca/php-fpm/local/data/app/cache/articles/twig/b8/b86ff870da894b5cdcf66b38c2e920b8e124d7fe7471327e2062929dcc2a6d16.php

-rw-r--r-- 1 82 82

Nomad Client logs

nomad client logs

The text was updated successfully, but these errors were encountered:

schmichael · 2017-10-19T22:01:29Z

I suspect that this is due to GC issues I'm working on as we speak! Is there any chance you could bump the log level to debug on a node and paste a similar subset?

jzvelc · 2017-10-27T09:10:43Z

Here it is:
https://pastebin.com/CLew3MXC

Now I am also running nomad as root user which solved permission denied errors.

schmichael · 2017-10-27T16:45:22Z

Excellent, thanks for the updated logs. This in particular is definitely a bug I'll look into:

Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: panic: runtime error: invalid memory address or nil pointer dereference
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf84516]
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: goroutine 165 [running]:
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: github.com/hashicorp/nomad/client.(*TaskRunner).Destroy(0x0, 0xc42063c000)
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]:         /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:1754 +0x26
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: github.com/hashicorp/nomad/client.(*AllocRunner).destroyTaskRunners(0xc42019b4a0, 0xc42063c000)
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]:         /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:903 +0x3db
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: github.com/hashicorp/nomad/client.(*AllocRunner).Run(0xc42019b4a0)
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]:         /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:873 +0xd9e
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: created by github.com/hashicorp/nomad/client.(*Client).restoreState
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]:         /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:633 +0x729
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: panic: runtime error: invalid memory address or nil pointer dereference
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf84516]

Regarding the permissions issues: Had you initially run Nomad as root and then restarted it as a non-root user? I'm not sure how else you would get permission issues. Generally we recommend running Nomad as root although the docker and raw_exec drivers should work fine as non-root users.

Are you still getting OOM killed after running as root again? If so do you have any logs from when that happens?

jzvelc · 2017-11-02T09:10:18Z

I initially ran it as nomad user. The problem was that Nomad wasn't able to GC downloaded artifacts due to permission issues. I solved this by switching to root user.

Note that I am still constantly getting following warnings:

[WARN] client: garbage collection due to number of allocations is over the limit (50) skipped because no terminal allocations

schmichael · 2017-11-02T16:41:38Z

If you have over 50 running allocations on that node that can be ignored. One way to check if you have curl and jq installed:

$ curl -s localhost:4646/v1/node/e16562df/allocations | jq '. | length'
26

To get rid of the warning you can bump the max_allocs setting to squelch the warning and GC less aggressively.

However if you do not have 50 running allocations on this node you were probably bit by the bug fixed in #3445. It will be released in 0.7.1, but I've attached a build if you'd like to test.

linux_amd64.zip

schmichael · 2017-11-02T17:58:59Z

I've also submitted a followup PR to lower that log level in most situations: #3490

jzvelc · 2017-11-03T10:28:00Z

I switched to a build that you provided but the client just won't start:
https://pastebin.com/RJvvNe5t

Fixes the panic mentioned in #3420 (comment) While a leader task dying serially stops all follower tasks, the synchronizing of state is asynchrnous. Nomad can shutdown before all follower tasks have updated their state to dead thus saving the state necessary to hit this panic: *have a non-terminal alloc with a dead leader.* The actual fix is a simple nil check to not assume non-terminal allocs leader's have a TaskRunner.

schmichael · 2017-11-03T23:19:30Z

It appears to be a bug unrelated to your previous issues: Nomad panics when restoring an alloc whose leader task failed before the previous shutdown.

Am I correct in assuming you have an allocation with a leader = true task? If not I can keep digging!

If so please test the binary attached to #3502 if you're able.

Fixes the panic mentioned in #3420 (comment) While a leader task dying serially stops all follower tasks, the synchronizing of state is asynchrnous. Nomad can shutdown before all follower tasks have updated their state to dead thus saving the state necessary to hit this panic: *have a non-terminal alloc with a dead leader.* The actual fix is a simple nil check to not assume non-terminal allocs leader's have a TaskRunner.

dadgar · 2017-11-14T01:50:32Z

@jzvelc Would you be opposed to closing this? I believe Michael has a fix in #3502 but there is no information regarding a memory leak in this issue and it seems instead to be a permissions issue that you resolved

jzvelc · 2017-11-14T07:26:36Z

I tested following build linux_amd64.zip and I can confirm that our nomad clients are now stable. I wasn't able to test #3502 but I believe it will solve issues with failed leader tasks.

Fixes the panic mentioned in #3420 (comment) While a leader task dying serially stops all follower tasks, the synchronizing of state is asynchrnous. Nomad can shutdown before all follower tasks have updated their state to dead thus saving the state necessary to hit this panic: *have a non-terminal alloc with a dead leader.* The actual fix is a simple nil check to not assume non-terminal allocs leader's have a TaskRunner.

github-actions · 2022-12-06T02:16:04Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael added type/bug theme/client stage/needs-investigation labels Oct 19, 2017

schmichael mentioned this issue Nov 3, 2017

Handle leader task being dead in RestoreState #3502

Merged

jzvelc closed this as completed Nov 14, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #3420

Memory leak #3420

jzvelc commented Oct 19, 2017

schmichael commented Oct 19, 2017

jzvelc commented Oct 27, 2017 •

edited

Loading

schmichael commented Oct 27, 2017 •

edited

Loading

jzvelc commented Nov 2, 2017

schmichael commented Nov 2, 2017

schmichael commented Nov 2, 2017

jzvelc commented Nov 3, 2017

schmichael commented Nov 3, 2017

dadgar commented Nov 14, 2017

jzvelc commented Nov 14, 2017

github-actions bot commented Dec 6, 2022

Memory leak #3420

Memory leak #3420

Comments

jzvelc commented Oct 19, 2017

Nomad version

Operating system and Environment details

Issue

Nomad Client logs

schmichael commented Oct 19, 2017

jzvelc commented Oct 27, 2017 • edited Loading

schmichael commented Oct 27, 2017 • edited Loading

jzvelc commented Nov 2, 2017

schmichael commented Nov 2, 2017

schmichael commented Nov 2, 2017

jzvelc commented Nov 3, 2017

schmichael commented Nov 3, 2017

dadgar commented Nov 14, 2017

jzvelc commented Nov 14, 2017

github-actions bot commented Dec 6, 2022

jzvelc commented Oct 27, 2017 •

edited

Loading

schmichael commented Oct 27, 2017 •

edited

Loading