Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad server failed and fails to recover - panic on start #4207

Closed
djenriquez opened this issue Apr 25, 2018 · 13 comments · Fixed by #4215
Closed

Nomad server failed and fails to recover - panic on start #4207

djenriquez opened this issue Apr 25, 2018 · 13 comments · Fixed by #4215

Comments

@djenriquez
Copy link

djenriquez commented Apr 25, 2018

Nomad version

Nomad v0.8.0 (751b42a)

Operating system and Environment details

Issue

We have a development (thank goodness) environment go down all of a sudden...not really sure what happened, but none of the servers are able to establish leadership. Attempting to restart nomad shows the following panic:

    2018/04/25 04:24:47.759383 [INFO] nomad: cluster leadership acquired
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa8 pc=0x10eaa49]

goroutine 146 [running]:
github.com/hashicorp/nomad/nomad/drainer.handleTaskGroup(0xc420b07d18, 0xc420b07600, 0xc42030c880, 0xc420b20700, 0x5, 0x8, 0x1, 0xc420b206c0, 0xc420b075f0, 0x4760e2)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:355 +0x779
github.com/hashicorp/nomad/nomad/drainer.handleJob(0xc420b07d18, 0xc4203a9380, 0xc4207d3dc0, 0x5, 0x8, 0x1, 0xc4203a9380, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:324 +0x575
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc420287e30)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:214 +0x4b2
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:88 +0x1e0

Can't get a leader up, cant get things running, nomad outage recovery doesn't work either. Looks like theres bad state somewhere?

Reproduction steps

Not really sure how to reproduce this, we found this cluster this way.

@djenriquez
Copy link
Author

Just upgraded the binary to 0.8.1 and tried again:

Nomad Version

Nomad v0.8.1 (46aa11b)

    2018/04/25 04:29:59.500372 [INFO] nomad: cluster leadership acquired
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa8 pc=0x10eab99]

goroutine 146 [running]:
github.com/hashicorp/nomad/nomad/drainer.handleTaskGroup(0xc420e2dd18, 0xc420e2d600, 0xc42030a380, 0xc420d684e0, 0x3, 0x4, 0x1, 0xc420b16240, 0xc420e2d5f0, 0x4760e2)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:355 +0x779
github.com/hashicorp/nomad/nomad/drainer.handleJob(0xc420e2dd18, 0xc4203a1380, 0xc420d683c0, 0x3, 0x4, 0x1, 0xc4203a1380, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:324 +0x575
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc4202685b0)
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:214 +0x4b2
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
	/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:88 +0x1e0

@djenriquez
Copy link
Author

I was able to get past this by building a binary that checks if node is nil. Once the server was up, all other servers were able to join in with the 0.8.0 release version just fine. I turned off the dev version, changed it back to 0.8.0 and now everything is back happy on 0.8.0.

See PR #4208

@chelseakomlo
Copy link
Contributor

Thanks for the PR, and the detailed description. Would you mind also including your agent and job configuration? We would like to reproduce on our end as well.

Were you starting from a fresh cluster state?

@djenriquez
Copy link
Author

djenriquez commented Apr 25, 2018 via email

@djenriquez
Copy link
Author

@chelseakomlo Looks like that deployment is in that bad state again. I need to fix it by doing what I did yesterday with the dev binary, but should have logs to help out if needed.

@djenriquez
Copy link
Author

djenriquez commented Apr 25, 2018

This is where it all came crashing down, not sure if these logs are useful though...

Apr 25 05:42:32 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: EOF
Apr 25 05:42:32 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: failed to get conn: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:33 ip-10-2-9-105 nomad[13448]: raft: Rejecting vote request from 10.2.11.3:4647 since we have a leader: 10.2.19.112:4647
Apr 25 05:42:34 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: failed to get conn: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-9-105 nomad[13448]: raft: Heartbeat timeout from "10.2.19.112:4647" reached, starting election

This server then retried elections unsuccessfully until I corrected it a few minutes ago.

One server reported:

Apr 25 05:42:32 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: EOF
Apr 25 05:42:32 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: failed to get conn: rpc error: lead thread didn't get connection
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: raft: Heartbeat timeout from "10.2.19.112:4647" reached, starting election
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: raft: Node at 10.2.11.3:4647 [Candidate] entering Candidate state in term 1393
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: raft: Failed to make RequestVote RPC to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Duplicate RequestVote for same term: 1393
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Election timeout reached, restarting election
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Node at 10.2.11.3:4647 [Candidate] entering Candidate state in term 1394
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Failed to make RequestVote RPC to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Election won. Tally: 2
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Node at 10.2.11.3:4647 [Leader] entering Leader state
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Added peer 10.2.9.105:4647, starting replication
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Added peer 10.2.19.112:4647, starting replication
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: nomad: cluster leadership acquired
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Failed to AppendEntries to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: pipelining replication to peer {Voter 10.2.9.105:4647 10.2.9.105:4647}
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Failed to AppendEntries to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 init: nomad main process (13504) terminated with status 2

The third and final server reported:

Apr 25 05:40:52 ip-10-2-19-112 nomad[19679]: nomad.heartbeat: node 'b39baaa0-897f-5c82-9342-fb4317fed09e' TTL expired
Apr 25 05:42:32 ip-10-2-19-112 init: nomad main process (19679) terminated with status 2

I believe 10.2.11.3 was the leader as it had the most recent log stating.

Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: nomad: cluster leadership acquired

vs

Apr 25 05:07:43 ip-10-2-19-112 nomad[19679]: nomad: cluster leadership acquired

and

Apr 25 04:19:05 ip-10-2-9-105 nomad[12831]: nomad: cluster leadership acquired

EDIT:
10.2.19.112 was the leader, it was the last server to report leadership acquisition before the crash. 10.2.11.3's last leadership acquition was actually:

Apr 25 05:02:55 ip-10-2-11-3 nomad[13271]: nomad: cluster leadership acquired

Only after 10.2.19.112 crashed did 10.2.11.3 acquire leadership at Apr 25 05:42:35.

@burdandrei
Copy link
Contributor

just received this in production cluster =(
better ideas than running binary compiled from @dadgar PR?

@qkate
Copy link
Contributor

qkate commented Apr 25, 2018

We're working on getting a release out with this fix in it soon (on the order of days).

@burdandrei
Copy link
Contributor

burdandrei commented Apr 25, 2018

One thing to add. one of the nomad clients for some reason was added to raft peer list:

Node                           ID               Address          State     Voter  RaftProtocol
(unknown)                      10.1.3.92:4647   10.1.3.92:4647   follower  true   unknown
i-0b06d30a175f35dd2.us-east-1  10.1.7.103:4647  10.1.7.103:4647  follower  true   2
i-017fcda8499a8da74.us-east-1  10.1.1.65:4647   10.1.1.65:4647   leader    true   2
i-026ffbbc26159973d.us-east-1  10.1.9.90:4647   10.1.9.90:4647   follower  true   2
i-0a84f22df5e58d875.us-east-1  10.1.9.143:4647  10.1.9.143:4647  follower  true   2
i-0e1487563f2f252a7.us-east-1  10.1.3.58:4647   10.1.3.58:4647   follower  true   2

@burdandrei
Copy link
Contributor

UPD:
looks like old servers IP was reused because of autoscaling, and it made nomad a bit crazy

@djenriquez
Copy link
Author

How old we talking @burdandrei? Terminated just hours/days ago? Is this something the garbage collector usually catches but the IP was reused too soon?

@burdandrei
Copy link
Contributor

ip was reused within hours

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants