Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #3420

Closed
jzvelc opened this issue Oct 19, 2017 · 11 comments
Closed

Memory leak #3420

jzvelc opened this issue Oct 19, 2017 · 11 comments

Comments

@jzvelc
Copy link

jzvelc commented Oct 19, 2017

Nomad version

0.6.3

Operating system and Environment details

Ubuntu 16.04.03 LTS (GNU/Linux 4.4.0-1038-aws x86_64)

Issue

We are running 3 nomad servers and 5 nomad clients.
On daily basis we experience issues with nomad clients crashing or consuming all available host memory. Issues started at Oct 19 07:45:28. I suspect this is related with permissions and GC when running nomad as a non root user (failed to remove alloc dir - permission denied).

Artifact was downloaded and extracted to local/data/app/cache.
This folder is then mounted to /data/app/cache:

volumes = [
    "local/data/app/cache:/data/app/cache"
]

This probably causes permission issues since uid and gid are wrong:
/var/lib/nomad/alloc/6a536fb2-e1b0-606c-5146-ff3ccd5023ca/php-fpm/local/data/app/cache/articles/twig/b8/b86ff870da894b5cdcf66b38c2e920b8e124d7fe7471327e2062929dcc2a6d16.php

-rw-r--r-- 1 82 82

Nomad Client logs

nomad client logs

@schmichael
Copy link
Member

I suspect that this is due to GC issues I'm working on as we speak! Is there any chance you could bump the log level to debug on a node and paste a similar subset?

@jzvelc
Copy link
Author

jzvelc commented Oct 27, 2017

Here it is:
https://pastebin.com/CLew3MXC

Now I am also running nomad as root user which solved permission denied errors.

@schmichael
Copy link
Member

schmichael commented Oct 27, 2017

Excellent, thanks for the updated logs. This in particular is definitely a bug I'll look into:

Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: panic: runtime error: invalid memory address or nil pointer dereference
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf84516]
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: goroutine 165 [running]:
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: github.com/hashicorp/nomad/client.(*TaskRunner).Destroy(0x0, 0xc42063c000)
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]:         /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:1754 +0x26
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: github.com/hashicorp/nomad/client.(*AllocRunner).destroyTaskRunners(0xc42019b4a0, 0xc42063c000)
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]:         /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:903 +0x3db
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: github.com/hashicorp/nomad/client.(*AllocRunner).Run(0xc42019b4a0)
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]:         /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:873 +0xd9e
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: created by github.com/hashicorp/nomad/client.(*Client).restoreState
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]:         /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:633 +0x729
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: panic: runtime error: invalid memory address or nil pointer dereference
Oct 27 09:05:56 nomad-general-i-0e3e8913a0825a291 nomad[18779]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf84516]

Regarding the permissions issues: Had you initially run Nomad as root and then restarted it as a non-root user? I'm not sure how else you would get permission issues. Generally we recommend running Nomad as root although the docker and raw_exec drivers should work fine as non-root users.

Are you still getting OOM killed after running as root again? If so do you have any logs from when that happens?

@jzvelc
Copy link
Author

jzvelc commented Nov 2, 2017

I initially ran it as nomad user. The problem was that Nomad wasn't able to GC downloaded artifacts due to permission issues. I solved this by switching to root user.

Note that I am still constantly getting following warnings:

[WARN] client: garbage collection due to number of allocations is over the limit (50) skipped because no terminal allocations

@schmichael
Copy link
Member

If you have over 50 running allocations on that node that can be ignored. One way to check if you have curl and jq installed:

$ curl -s localhost:4646/v1/node/e16562df/allocations | jq '. | length'
26

To get rid of the warning you can bump the max_allocs setting to squelch the warning and GC less aggressively.

However if you do not have 50 running allocations on this node you were probably bit by the bug fixed in #3445. It will be released in 0.7.1, but I've attached a build if you'd like to test.

linux_amd64.zip

@schmichael
Copy link
Member

I've also submitted a followup PR to lower that log level in most situations: #3490

@jzvelc
Copy link
Author

jzvelc commented Nov 3, 2017

I switched to a build that you provided but the client just won't start:
https://pastebin.com/RJvvNe5t

schmichael added a commit that referenced this issue Nov 3, 2017
Fixes the panic mentioned in
#3420 (comment)

While a leader task dying serially stops all follower tasks, the
synchronizing of state is asynchrnous. Nomad can shutdown before all
follower tasks have updated their state to dead thus saving the state
necessary to hit this panic: *have a non-terminal alloc with a dead
leader.*

The actual fix is a simple nil check to not assume non-terminal allocs
leader's have a TaskRunner.
schmichael added a commit that referenced this issue Nov 3, 2017
Fixes the panic mentioned in
#3420 (comment)

While a leader task dying serially stops all follower tasks, the
synchronizing of state is asynchrnous. Nomad can shutdown before all
follower tasks have updated their state to dead thus saving the state
necessary to hit this panic: *have a non-terminal alloc with a dead
leader.*

The actual fix is a simple nil check to not assume non-terminal allocs
leader's have a TaskRunner.
schmichael added a commit that referenced this issue Nov 3, 2017
Fixes the panic mentioned in
#3420 (comment)

While a leader task dying serially stops all follower tasks, the
synchronizing of state is asynchrnous. Nomad can shutdown before all
follower tasks have updated their state to dead thus saving the state
necessary to hit this panic: *have a non-terminal alloc with a dead
leader.*

The actual fix is a simple nil check to not assume non-terminal allocs
leader's have a TaskRunner.
@schmichael
Copy link
Member

It appears to be a bug unrelated to your previous issues: Nomad panics when restoring an alloc whose leader task failed before the previous shutdown.

Am I correct in assuming you have an allocation with a leader = true task? If not I can keep digging!

If so please test the binary attached to #3502 if you're able.

schmichael added a commit that referenced this issue Nov 13, 2017
Fixes the panic mentioned in
#3420 (comment)

While a leader task dying serially stops all follower tasks, the
synchronizing of state is asynchrnous. Nomad can shutdown before all
follower tasks have updated their state to dead thus saving the state
necessary to hit this panic: *have a non-terminal alloc with a dead
leader.*

The actual fix is a simple nil check to not assume non-terminal allocs
leader's have a TaskRunner.
@dadgar
Copy link
Contributor

dadgar commented Nov 14, 2017

@jzvelc Would you be opposed to closing this? I believe Michael has a fix in #3502 but there is no information regarding a memory leak in this issue and it seems instead to be a permissions issue that you resolved

@jzvelc
Copy link
Author

jzvelc commented Nov 14, 2017

I tested following build linux_amd64.zip and I can confirm that our nomad clients are now stable. I wasn't able to test #3502 but I believe it will solve issues with failed leader tasks.

@jzvelc jzvelc closed this as completed Nov 14, 2017
schmichael added a commit that referenced this issue Nov 15, 2017
Fixes the panic mentioned in
#3420 (comment)

While a leader task dying serially stops all follower tasks, the
synchronizing of state is asynchrnous. Nomad can shutdown before all
follower tasks have updated their state to dead thus saving the state
necessary to hit this panic: *have a non-terminal alloc with a dead
leader.*

The actual fix is a simple nil check to not assume non-terminal allocs
leader's have a TaskRunner.
@github-actions
Copy link

github-actions bot commented Dec 6, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants