Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV on startup of nomad client since 0.5.3 #2256

Closed
hynek opened this issue Jan 31, 2017 · 10 comments · Fixed by #2257
Closed

SIGSEGV on startup of nomad client since 0.5.3 #2256

hynek opened this issue Jan 31, 2017 · 10 comments · Fixed by #2257

Comments

@hynek
Copy link
Contributor

hynek commented Jan 31, 2017

Nomad version

Output from nomad version

Nomad v0.5.3

64bit, tried both the LXC and non-LXC versions.

Operating system and Environment details

Host: Ubuntu Xenial, running in a LXC container.
Docker for Jobs:

Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:58:26 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:58:26 2017
 OS/Arch:      linux/amd64
 Experimental: false

Issue

After updating to 0.5.3, nomad agent crashes on startup if started in client mode.

There seems to be some kind of correlation between the presence and absence of jobs on that node (I just upgraded without draining and ended up with a useless cluster). I’ve attached those two types of crashes.

Currently I’m unable to get one of my nodes back up. :( The others for some reason are working again.

Let me know if you need any more intel or if you know have any hints on how to resolve this…

Reproduction steps

Start it.

Nomad Server logs (if appropriate)

n/a, server works fine.

Nomad Client logs (if appropriate)

No jobs, clean docker

c-2001:~# /usr/local/bin/nomad agent -config /etc/nomad/client.hcl
    Loaded configuration from /etc/nomad/client.hcl
==> Starting Nomad agent...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5ed497]

goroutine 70 [running]:
panic(0x10227c0, 0xc420012090)
	/opt/go/src/runtime/panic.go:500 +0x1a1
github.com/hashicorp/nomad/client/driver.(*CreatedResources).Copy(0x0, 0x1255710)
	/opt/gopath/src/github.com/hashicorp/nomad/client/driver/driver.go:108 +0x57
github.com/hashicorp/nomad/client.(*TaskRunner).SaveState(0xc42000f600, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:307 +0xc5
github.com/hashicorp/nomad/client.(*TaskRunner).setState(0xc42000f600, 0x119d299, 0x7, 0xc420189d40)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:338 +0x3c
github.com/hashicorp/nomad/client.(*TaskRunner).createDriver.func1(0x11b3444, 0x17, 0xc4201dcde0, 0x2, 0x2)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:380 +0x1f0
github.com/hashicorp/nomad/client/driver.(*DockerDriver).pullImage(0xc420454cd0, 0xc4201861c0, 0xc42039c640, 0xc42011d0b0, 0x1c, 0xc42011d0cd, 0x6, 0x2, 0x6)
	/opt/gopath/src/github.com/hashicorp/nomad/client/driver/docker.go:1007 +0x2bc
github.com/hashicorp/nomad/client/driver.(*DockerDriver).createImage(0xc420454cd0, 0xc4201861c0, 0xc42039c640, 0xc4203c6b60, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/driver/docker.go:972 +0x1af
github.com/hashicorp/nomad/client/driver.(*DockerDriver).Prestart(0xc420454cd0, 0xc4201dc080, 0xc420184820, 0x0, 0x0, 0xc4200ffe1c)
	/opt/gopath/src/github.com/hashicorp/nomad/client/driver/docker.go:383 +0xe2
github.com/hashicorp/nomad/client.(*TaskRunner).startTask(0xc42000f600, 0xc420142960, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:1160 +0x27d
github.com/hashicorp/nomad/client.(*TaskRunner).run(0xc42000f600)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:902 +0x38a
github.com/hashicorp/nomad/client.(*TaskRunner).Run(0xc42000f600)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:444 +0x6a1
created by github.com/hashicorp/nomad/client.(*AllocRunner).RestoreState
	/opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:190 +0x891
c-2001:~#

Jobs present, still running inside Docker

Jan 31 11:16:53 c-2001 nomad-client[901]:     Loaded configuration from /etc/nomad/client.hcl
Jan 31 11:16:53 c-2001 nomad-client[901]: ==> Starting Nomad agent...
Jan 31 11:16:57 c-2001 nomad-client[901]: ==> Nomad agent configuration:
Jan 31 11:16:57 c-2001 nomad-client[901]:                  Atlas: <disabled>
Jan 31 11:16:57 c-2001 nomad-client[901]:                 Client: true
Jan 31 11:16:57 c-2001 nomad-client[901]:              Log Level: INFO
Jan 31 11:16:57 c-2001 nomad-client[901]:                 Region: global (DC: scaleup)
Jan 31 11:16:57 c-2001 nomad-client[901]:                 Server: false
Jan 31 11:16:57 c-2001 nomad-client[901]:                Version: 0.5.3
Jan 31 11:16:57 c-2001 nomad-client[901]: ==> Nomad agent started! Log data will stream in below:
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.707555 [INFO] client: using state directory /vrmd/nomad/client
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.707632 [INFO] client: using alloc directory /vrmd/nomad/alloc
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.708129 [INFO] fingerprint.cgroups: cgroups are available
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.712605 [INFO] fingerprint.consul: consul agent is available
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.737729 [INFO] driver.docker: re-attaching to docker process: 5e46cc5404ba8634d31b586be4bb7b03381be29f67c7987cf900a559fe3d0071
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.738667 [ERR] client: failed to open handle to task 'httpbin' for alloc '9c30dfb4-d9b3-6adb-148d-eea7b53eee9d': Failed to find container 5e46cc5404ba8634d31b586be4bb7b03381be29f67c7987cf900a559fe3d0071
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.740146 [INFO] driver.docker: re-attaching to docker process: 4407630c7f5a072387402dc64b323b0113ceac614ea8310e4c2ecf32ddcde64f
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.740957 [ERR] client: failed to open handle to task 'hashi-ui' for alloc 'c6440e03-e17e-64f6-704b-c656598c474e': Failed to find container 4407630c7f5a072387402dc64b323b0113ceac614ea8310e4c2ecf32ddcde64f
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.740990 [INFO] client: Node ID "1353a057-dcfb-def7-3385-87c75884b01e"
Jan 31 11:16:57 c-2001 nomad-client[901]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 31 11:16:57 c-2001 nomad-client[901]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5eb974]
Jan 31 11:16:57 c-2001 nomad-client[901]: goroutine 28 [running]:
Jan 31 11:16:57 c-2001 nomad-client[901]: panic(0x10001e0, 0xc420010080)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/go/src/runtime/panic.go:500 +0x1a1
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client/driver.(*CreatedResources).Merge(0x0, 0xc4200241d0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/driver/driver.go:127 +0xe4
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client.(*TaskRunner).startTask(0xc42013d1e0, 0xc42040bda0, 0x0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:1164 +0x2e3
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client.(*TaskRunner).run(0xc42013d1e0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:902 +0x38a
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client.(*TaskRunner).Run(0xc42013d1e0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:444 +0x6a1
Jan 31 11:16:57 c-2001 nomad-client[901]: created by github.com/hashicorp/nomad/client.(*AllocRunner).RestoreState
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:190 +0x891
@tantra35
Copy link
Contributor

tantra35 commented Jan 31, 2017

We have the same when upgraded nomad to 0.5.3, without node-drain:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5eb647]

goroutine 335 [running]:
panic(0x10001e0, 0xc420012060)
        /opt/go/src/runtime/panic.go:500 +0x1a1
github.com/hashicorp/nomad/client/driver.(*CreatedResources).Copy(0x0, 0x122e5b0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/driver/driver.go:108 +0x57
github.com/hashicorp/nomad/client.(*TaskRunner).SaveState(0xc42008b4a0, 0x0, 0x0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:307 +0xc5
github.com/hashicorp/nomad/client.(*AllocRunner).saveTaskRunnerState(0xc4200be1e0, 0xc42008b4a0, 0x1, 0x1)
        /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:249 +0x35
github.com/hashicorp/nomad/client.(*AllocRunner).SaveState(0xc4200be1e0, 0xc420450150, 0xc420830c10)
        /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:215 +0xe9
github.com/hashicorp/nomad/client.(*Client).saveState(0xc4202cd040, 0x1188f94, 0x13)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:604 +0x120
github.com/hashicorp/nomad/client.(*Client).runAllocs(0xc4202cd040, 0xc4202ecbc0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1507 +0x797
github.com/hashicorp/nomad/client.(*Client).run(0xc4202cd040)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:969 +0x119
created by github.com/hashicorp/nomad/client.NewClient
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:298 +0xc7a
    Loaded configuration from /etc/nomad/nomad.json

after we have done some cleanup (stop jobs that was placed on upgraded node, also we have made node-drain for upgraded node), nomad begins work as expected

@dadgar
Copy link
Contributor

dadgar commented Jan 31, 2017

Hey sorry this happened 👎 to recover can you delete the clients data_dir and bring it back up.

@dadgar
Copy link
Contributor

dadgar commented Jan 31, 2017

We will make sure 0.5.4 allows an in-place upgrade path for those who would like to wait!

@tantra35
Copy link
Contributor

tantra35 commented Jan 31, 2017

Alex, may i make conclusion, that make nomad node-drain, then upgrade will have done safe?

@dadgar
Copy link
Contributor

dadgar commented Jan 31, 2017

Potentially not. The client has some state files in the data_dir that it tries to restore from. In 0.5.3 we introduced new fields in that state_file and the upgrade isn't being handled properly it seems.

So I suggest you nomad node-drain and then delete the data_dir and bring the client back up

@dadgar dadgar added this to the v0.5.4 milestone Jan 31, 2017
@schmichael schmichael self-assigned this Jan 31, 2017
@schmichael
Copy link
Member

Repro'd in like 30s using Nomad 0.5.2 and 0.5.3 binaries with the example.nomad Redis job.

Very embarrassed I let this slip in. Fix coming.

schmichael added a commit that referenced this issue Jan 31, 2017
Combined with b522c47 this fixes #2256

Without these two commits in place upgrades to 0.5.3 panics.
schmichael added a commit that referenced this issue Jan 31, 2017
@holtwilkins
Copy link
Contributor

holtwilkins commented Feb 1, 2017

Hey @schmichael , any idea when 0.5.4 will be up at https://releases.hashicorp.com/nomad/ ?

EDIT: it's there now!

@jippi
Copy link
Contributor

jippi commented Feb 1, 2017

@schmichael we love you anyway!

@schmichael
Copy link
Member

@holtwilkins It would have been up sooner, but this was only the second time I've driven a release and was pretty slow at it. Thanks for your patience!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants