Nomad server failed and fails to recover - panic on start #4207

djenriquez · 2018-04-25T04:27:20Z

Nomad version

Nomad v0.8.0 (751b42a)

Operating system and Environment details

Issue

We have a development (thank goodness) environment go down all of a sudden...not really sure what happened, but none of the servers are able to establish leadership. Attempting to restart nomad shows the following panic:

    2018/04/25 04:24:47.759383 [INFO] nomad: cluster leadership acquired
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa8 pc=0x10eaa49]

goroutine 146 [running]:
github.com/hashicorp/nomad/nomad/drainer.handleTaskGroup(0xc420b07d18, 0xc420b07600, 0xc42030c880, 0xc420b20700, 0x5, 0x8, 0x1, 0xc420b206c0, 0xc420b075f0, 0x4760e2)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:355 +0x779
github.com/hashicorp/nomad/nomad/drainer.handleJob(0xc420b07d18, 0xc4203a9380, 0xc4207d3dc0, 0x5, 0x8, 0x1, 0xc4203a9380, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:324 +0x575
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc420287e30)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:214 +0x4b2
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:88 +0x1e0

Can't get a leader up, cant get things running, nomad outage recovery doesn't work either. Looks like theres bad state somewhere?

Reproduction steps

Not really sure how to reproduce this, we found this cluster this way.

The text was updated successfully, but these errors were encountered:

djenriquez · 2018-04-25T04:32:06Z

Just upgraded the binary to 0.8.1 and tried again:

Nomad Version

Nomad v0.8.1 (46aa11b)

    2018/04/25 04:29:59.500372 [INFO] nomad: cluster leadership acquired
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa8 pc=0x10eab99]

goroutine 146 [running]:
github.com/hashicorp/nomad/nomad/drainer.handleTaskGroup(0xc420e2dd18, 0xc420e2d600, 0xc42030a380, 0xc420d684e0, 0x3, 0x4, 0x1, 0xc420b16240, 0xc420e2d5f0, 0x4760e2)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:355 +0x779
github.com/hashicorp/nomad/nomad/drainer.handleJob(0xc420e2dd18, 0xc4203a1380, 0xc420d683c0, 0x3, 0x4, 0x1, 0xc4203a1380, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:324 +0x575
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc4202685b0)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:214 +0x4b2
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:88 +0x1e0

djenriquez · 2018-04-25T05:18:53Z

I was able to get past this by building a binary that checks if node is nil. Once the server was up, all other servers were able to join in with the 0.8.0 release version just fine. I turned off the dev version, changed it back to 0.8.0 and now everything is back happy on 0.8.0.

See PR #4208

chelseakomlo · 2018-04-25T13:15:40Z

Thanks for the PR, and the detailed description. Would you mind also including your agent and job configuration? We would like to reproduce on our end as well.

Were you starting from a fresh cluster state?

djenriquez · 2018-04-25T14:28:52Z

We were not. This cluster was active and had tens of jobs running over 100 allocations. We did have an autoscaling agent here (replicator) which is responsible for scaling jobs and nodes in and out. I believe draining an allocation and terminating a node consecutively caused the bad state, though that’s a pretty normal operation that has worked hundreds of times over the last few months. DJ Enriquez

…

On Apr 25, 2018, at 6:15 AM, Chelsea Komlo ***@***.***> wrote: Thanks for the PR, and the detailed description. Would you mind also including your agent and job configuration? We would like to reproduce on our end as well. Were you starting from a fresh cluster state? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

djenriquez · 2018-04-25T15:15:00Z

@chelseakomlo Looks like that deployment is in that bad state again. I need to fix it by doing what I did yesterday with the dev binary, but should have logs to help out if needed.

djenriquez · 2018-04-25T15:28:48Z

This is where it all came crashing down, not sure if these logs are useful though...

Apr 25 05:42:32 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: EOF
Apr 25 05:42:32 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: failed to get conn: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:33 ip-10-2-9-105 nomad[13448]: raft: Rejecting vote request from 10.2.11.3:4647 since we have a leader: 10.2.19.112:4647
Apr 25 05:42:34 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: failed to get conn: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-9-105 nomad[13448]: raft: Heartbeat timeout from "10.2.19.112:4647" reached, starting election

This server then retried elections unsuccessfully until I corrected it a few minutes ago.

One server reported:

Apr 25 05:42:32 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: EOF
Apr 25 05:42:32 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: failed to get conn: rpc error: lead thread didn't get connection
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: raft: Heartbeat timeout from "10.2.19.112:4647" reached, starting election
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: raft: Node at 10.2.11.3:4647 [Candidate] entering Candidate state in term 1393
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: raft: Failed to make RequestVote RPC to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Duplicate RequestVote for same term: 1393
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Election timeout reached, restarting election
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Node at 10.2.11.3:4647 [Candidate] entering Candidate state in term 1394
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Failed to make RequestVote RPC to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Election won. Tally: 2
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Node at 10.2.11.3:4647 [Leader] entering Leader state
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Added peer 10.2.9.105:4647, starting replication
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Added peer 10.2.19.112:4647, starting replication
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: nomad: cluster leadership acquired
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Failed to AppendEntries to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: pipelining replication to peer {Voter 10.2.9.105:4647 10.2.9.105:4647}
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Failed to AppendEntries to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 init: nomad main process (13504) terminated with status 2

The third and final server reported:

Apr 25 05:40:52 ip-10-2-19-112 nomad[19679]: nomad.heartbeat: node 'b39baaa0-897f-5c82-9342-fb4317fed09e' TTL expired
Apr 25 05:42:32 ip-10-2-19-112 init: nomad main process (19679) terminated with status 2

I believe 10.2.11.3 was the leader as it had the most recent log stating.

Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: nomad: cluster leadership acquired

vs

Apr 25 05:07:43 ip-10-2-19-112 nomad[19679]: nomad: cluster leadership acquired

and

Apr 25 04:19:05 ip-10-2-9-105 nomad[12831]: nomad: cluster leadership acquired

EDIT:
10.2.19.112 was the leader, it was the last server to report leadership acquisition before the crash. 10.2.11.3's last leadership acquition was actually:

Apr 25 05:02:55 ip-10-2-11-3 nomad[13271]: nomad: cluster leadership acquired

Only after 10.2.19.112 crashed did 10.2.11.3 acquire leadership at Apr 25 05:42:35.

burdandrei · 2018-04-25T20:44:36Z

just received this in production cluster =(
better ideas than running binary compiled from @dadgar PR?

qkate · 2018-04-25T21:16:20Z

We're working on getting a release out with this fix in it soon (on the order of days).

burdandrei · 2018-04-25T21:17:14Z

One thing to add. one of the nomad clients for some reason was added to raft peer list:

Node                           ID               Address          State     Voter  RaftProtocol
(unknown)                      10.1.3.92:4647   10.1.3.92:4647   follower  true   unknown
i-0b06d30a175f35dd2.us-east-1  10.1.7.103:4647  10.1.7.103:4647  follower  true   2
i-017fcda8499a8da74.us-east-1  10.1.1.65:4647   10.1.1.65:4647   leader    true   2
i-026ffbbc26159973d.us-east-1  10.1.9.90:4647   10.1.9.90:4647   follower  true   2
i-0a84f22df5e58d875.us-east-1  10.1.9.143:4647  10.1.9.143:4647  follower  true   2
i-0e1487563f2f252a7.us-east-1  10.1.3.58:4647   10.1.3.58:4647   follower  true   2

burdandrei · 2018-04-25T21:23:17Z

UPD:
looks like old servers IP was reused because of autoscaling, and it made nomad a bit crazy

djenriquez · 2018-04-25T21:25:26Z

How old we talking @burdandrei? Terminated just hours/days ago? Is this something the garbage collector usually catches but the IP was reused too soon?

burdandrei · 2018-04-25T21:30:08Z

ip was reused within hours

github-actions · 2022-11-30T02:18:28Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

djenriquez mentioned this issue Apr 25, 2018

Check if drain alloc node exists #4208

Closed

chelseakomlo added the stage/waiting-reply label Apr 25, 2018

dadgar mentioned this issue Apr 25, 2018

Fix panic draining when alloc on non-existent node #4215

Merged

dadgar added type/bug theme/core theme/crash and removed stage/waiting-reply labels Apr 25, 2018

dadgar closed this as completed in #4215 Apr 25, 2018

github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad server failed and fails to recover - panic on start #4207

Nomad server failed and fails to recover - panic on start #4207

djenriquez commented Apr 25, 2018 •

edited

Loading

djenriquez commented Apr 25, 2018

djenriquez commented Apr 25, 2018

chelseakomlo commented Apr 25, 2018

djenriquez commented Apr 25, 2018 via email

djenriquez commented Apr 25, 2018

djenriquez commented Apr 25, 2018 •

edited

Loading

burdandrei commented Apr 25, 2018

qkate commented Apr 25, 2018

burdandrei commented Apr 25, 2018 •

edited

Loading

burdandrei commented Apr 25, 2018

djenriquez commented Apr 25, 2018

burdandrei commented Apr 25, 2018

github-actions bot commented Nov 30, 2022

Nomad server failed and fails to recover - panic on start #4207

Nomad server failed and fails to recover - panic on start #4207

Comments

djenriquez commented Apr 25, 2018 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

djenriquez commented Apr 25, 2018

Nomad Version

djenriquez commented Apr 25, 2018

chelseakomlo commented Apr 25, 2018

djenriquez commented Apr 25, 2018 via email

djenriquez commented Apr 25, 2018

djenriquez commented Apr 25, 2018 • edited Loading

burdandrei commented Apr 25, 2018

qkate commented Apr 25, 2018

burdandrei commented Apr 25, 2018 • edited Loading

burdandrei commented Apr 25, 2018

djenriquez commented Apr 25, 2018

burdandrei commented Apr 25, 2018

github-actions bot commented Nov 30, 2022

djenriquez commented Apr 25, 2018 •

edited

Loading

djenriquez commented Apr 25, 2018 •

edited

Loading

burdandrei commented Apr 25, 2018 •

edited

Loading