Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic draining when alloc on non-existent node #4215

Merged
merged 3 commits into from
Apr 25, 2018
Merged

Fix panic draining when alloc on non-existent node #4215

merged 3 commits into from
Apr 25, 2018

Conversation

dadgar
Copy link
Contributor

@dadgar dadgar commented Apr 25, 2018

This PR fixes a panic that occured when a job was being drained and the job has
an allocation that references a garbage collected node.

Added a unit test around the function that was panicking and an integration test that adds an allocation pointing to a non-existent node for all types of jobs and causes a drain. This tests that all parts of the code base handles the nil node (both tests panic without the fix and pass with the fix).

Fixes #4207

@dadgar dadgar changed the title Fix panic draining when alloc on non-existant node Fix panic draining when alloc on non-existent node Apr 25, 2018
@burdandrei
Copy link
Contributor

burdandrei commented Apr 25, 2018

just compiled this and pushed to flapping production cluster. it became a leader and stopped flapping!

@qkate
Copy link
Contributor

qkate commented Apr 25, 2018

@burdandrei Awesome, thanks for confirming the fix! I saw your comment over on 4207 as well, just wanted to let you know we'll be cutting another release soon (on the order of days) with this fix in it.

@burdandrei
Copy link
Contributor

@qkate release is great! let's add #3882 there too =)

@@ -125,6 +125,11 @@ func (n *drainingNode) DrainingJobs() ([]structs.NamespacedID, error) {
n.l.RLock()
defer n.l.RUnlock()

// Should never happen
if n.node == nil || n.node.DrainStrategy == nil {
return nil, fmt.Errorf("node doesn't have a drain strategy set")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is returning an error much better than panicking because won't it cause leader flapping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be safe how we use it: https://github.com/hashicorp/nomad/blob/master/nomad/drainer/watch_nodes.go#L51-L71

Update will only be called if the node is draining. So it is more going to help future uses not introduce bad logic.

@dadgar dadgar merged commit f595e9f into master Apr 25, 2018
@dadgar dadgar deleted the b-drain branch April 25, 2018 23:03
@ortz
Copy link
Contributor

ortz commented Apr 26, 2018

great! I encountered it yesterday in our production environment :(

@mlehner616
Copy link

Close one, I encountered this when testing draining in our staging environment. Thanks for the fix. I will be upgrading our prod cluster from 0.8.1 to 0.8.3 ASAP.

@barotn
Copy link

barotn commented Aug 31, 2018

I have upgraded to 0.8.4 and cluster was running fine for few days .. We experienced some issues and since then I am not able to rebuild cluster..

Aug 31 12:38:22 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:22 [DEBUG] serf: messageJoinType: ip-10-40-240-80.eu-west-1
Aug 31 12:38:23 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:23 [DEBUG] serf: messageJoinType: ip-10-40-240-80.eu-west-1
Aug 31 12:38:23 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:23 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:23 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:23 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:24 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:24 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:24 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:24 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:24 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:24 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:24 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:24 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:25 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:25 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26.237107 [ERR] http: Request /v1/status/peers, error: No cluster leader
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26.237525 [DEBUG] http: Request GET /v1/status/peers (5.111428064s)
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26.675554 [ERR] http: Request /v1/status/peers, error: No cluster leader
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26.676075 [DEBUG] http: Request GET /v1/status/peers (5.018258886s)
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26.756785 [ERR] http: Request /v1/status/peers, error: No cluster leader
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26.757314 [DEBUG] http: Request GET /v1/status/peers (5.14042012s)
Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:26 [DEBUG] memberlist: TCP connection from=10.40.241.37:39448
Aug 31 12:38:27 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:27 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:27 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:27 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:27 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:27.473917 [ERR] worker: failed to dequeue evaluation: No cluster leader
Aug 31 12:38:27 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:27 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:27 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:27 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:28 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:28 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
Aug 31 12:38:28 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:28 [DEBUG] memberlist: TCP connection from=172.17.0.3:58526
Aug 31 12:38:28 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:28.199927 [DEBUG] server.nomad: lost contact with Nomad quorum, falling back to Consul for server list.

=========================

nomad server members
Name Address Port Status Leader Protocol Build Datacenter Region
ip-10-40-240-237.eu-west-1 10.40.240.237 4648 alive false 2 0.8.4 ccoe-dev eu-west-1
ip-10-40-240-62.eu-west-1 10.40.240.62 4648 alive false 2 0.8.4 ccoe-dev eu-west-1
ip-10-40-241-124.eu-west-1 10.40.241.124 4648 alive false 2 0.8.4 ccoe-dev eu-west-1

Error determining leaders: 1 error(s) occurred:

  • Region "eu-west-1": Unexpected response code: 500 (No cluster leader)

=====================================

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad server failed and fails to recover - panic on start
7 participants