Fix panic draining when alloc on non-existent node #4215

dadgar · 2018-04-25T20:37:50Z

This PR fixes a panic that occured when a job was being drained and the job has
an allocation that references a garbage collected node.

Added a unit test around the function that was panicking and an integration test that adds an allocation pointing to a non-existent node for all types of jobs and causes a drain. This tests that all parts of the code base handles the nil node (both tests panic without the fix and pass with the fix).

Fixes #4207

burdandrei · 2018-04-25T21:05:40Z

just compiled this and pushed to flapping production cluster. it became a leader and stopped flapping!

qkate · 2018-04-25T21:11:14Z

@burdandrei Awesome, thanks for confirming the fix! I saw your comment over on 4207 as well, just wanted to let you know we'll be cutting another release soon (on the order of days) with this fix in it.

burdandrei · 2018-04-25T21:35:23Z

@qkate release is great! let's add #3882 there too =)

schmichael · 2018-04-25T22:40:55Z

nomad/drainer/draining_node.go

@@ -125,6 +125,11 @@ func (n *drainingNode) DrainingJobs() ([]structs.NamespacedID, error) {
 	n.l.RLock()
 	defer n.l.RUnlock()

+	// Should never happen
+	if n.node == nil || n.node.DrainStrategy == nil {
+		return nil, fmt.Errorf("node doesn't have a drain strategy set")


Is returning an error much better than panicking because won't it cause leader flapping?

It should be safe how we use it: https://github.com/hashicorp/nomad/blob/master/nomad/drainer/watch_nodes.go#L51-L71

Update will only be called if the node is draining. So it is more going to help future uses not introduce bad logic.

ortz · 2018-04-26T08:05:15Z

great! I encountered it yesterday in our production environment :(

mlehner616 · 2018-06-06T19:45:44Z

Close one, I encountered this when testing draining in our staging environment. Thanks for the fix. I will be upgrading our prod cluster from 0.8.1 to 0.8.3 ASAP.

barotn · 2018-08-31T15:13:29Z

I have upgraded to 0.8.4 and cluster was running fine for few days .. We experienced some issues and since then I am not able to rebuild cluster..

Aug 31 12:38:22 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:23 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:23 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:23 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:24 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:24 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:24 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:24 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:25 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:26 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:27 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:27 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:27 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:27 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:27 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:28 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:28 ip-10-40-240-80 nomad[16228]: Aug 31 12:38:28 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:22 [DEBUG] serf: messageJoinType: ip-10-40-240-80.eu-west-1
2018/08/31 12:38:23 [DEBUG] serf: messageJoinType: ip-10-40-240-80.eu-west-1
2018/08/31 12:38:23 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:23 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:24 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:24 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:24 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:24 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:25 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:26 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:26 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:26.237107 [ERR] http: Request /v1/status/peers, error: No cluster leader
2018/08/31 12:38:26.237525 [DEBUG] http: Request GET /v1/status/peers (5.111428064s)
2018/08/31 12:38:26 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:26 [DEBUG] serf: messageLeaveType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:26.675554 [ERR] http: Request /v1/status/peers, error: No cluster leader
2018/08/31 12:38:26.676075 [DEBUG] http: Request GET /v1/status/peers (5.018258886s)
2018/08/31 12:38:26 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:26 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:26.756785 [ERR] http: Request /v1/status/peers, error: No cluster leader
2018/08/31 12:38:26.757314 [DEBUG] http: Request GET /v1/status/peers (5.14042012s)
2018/08/31 12:38:26 [DEBUG] memberlist: TCP connection from=10.40.241.37:39448
2018/08/31 12:38:27 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:27 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:27.473917 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/08/31 12:38:27 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:27 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:28 [DEBUG] serf: messageJoinType: ip-10-40-241-37.eu-west-1
2018/08/31 12:38:28 [DEBUG] memberlist: TCP connection from=172.17.0.3:58526
2018/08/31 12:38:28.199927 [DEBUG] server.nomad: lost contact with Nomad quorum, falling back to Consul for server list.

=========================

nomad server members
Name Address Port Status Leader Protocol Build Datacenter Region
ip-10-40-240-237.eu-west-1 10.40.240.237 4648 alive false 2 0.8.4 ccoe-dev eu-west-1
ip-10-40-240-62.eu-west-1 10.40.240.62 4648 alive false 2 0.8.4 ccoe-dev eu-west-1
ip-10-40-241-124.eu-west-1 10.40.241.124 4648 alive false 2 0.8.4 ccoe-dev eu-west-1

Error determining leaders: 1 error(s) occurred:

Region "eu-west-1": Unexpected response code: 500 (No cluster leader)

=====================================

github-actions · 2023-02-27T02:18:20Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar changed the title ~~Fix panic draining when alloc on non-existant node~~ Fix panic draining when alloc on non-existent node Apr 25, 2018

dadgar mentioned this pull request Apr 25, 2018

Check if drain alloc node exists #4208

Closed

schmichael approved these changes Apr 25, 2018

View reviewed changes

dadgar added 3 commits April 25, 2018 16:00

Fix detecting drain strategy on GC'd node

9fd5847

Safety guard

913a4d3

Changelog

7bdbe43

dadgar force-pushed the b-drain branch from 8ca8484 to 7bdbe43 Compare April 25, 2018 23:01

dadgar merged commit f595e9f into master Apr 25, 2018

dadgar deleted the b-drain branch April 25, 2018 23:03

github-actions bot locked as resolved and limited conversation to collaborators Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix panic draining when alloc on non-existent node #4215

Fix panic draining when alloc on non-existent node #4215

dadgar commented Apr 25, 2018 •

edited

Loading

burdandrei commented Apr 25, 2018 •

edited

Loading

qkate commented Apr 25, 2018

burdandrei commented Apr 25, 2018

schmichael Apr 25, 2018

dadgar Apr 25, 2018

ortz commented Apr 26, 2018

mlehner616 commented Jun 6, 2018

barotn commented Aug 31, 2018 •

edited

Loading

github-actions bot commented Feb 27, 2023

Fix panic draining when alloc on non-existent node #4215

Fix panic draining when alloc on non-existent node #4215

Conversation

dadgar commented Apr 25, 2018 • edited Loading

burdandrei commented Apr 25, 2018 • edited Loading

qkate commented Apr 25, 2018

burdandrei commented Apr 25, 2018

schmichael Apr 25, 2018

Choose a reason for hiding this comment

dadgar Apr 25, 2018

Choose a reason for hiding this comment

ortz commented Apr 26, 2018

mlehner616 commented Jun 6, 2018

barotn commented Aug 31, 2018 • edited Loading

github-actions bot commented Feb 27, 2023

dadgar commented Apr 25, 2018 •

edited

Loading

burdandrei commented Apr 25, 2018 •

edited

Loading

barotn commented Aug 31, 2018 •

edited

Loading