Possible deadlock situation when leader is flapping #6852

hanshasselberg · 2019-11-29T16:57:13Z

Overview of the Issue

When running some load testing, we saw two of the servers having more goroutines than we expected. And even after our load testing was done, the number keept going up. The logs showed that the leader was flapping and one important bit is that the same server lost and acquired leadership in the same second.

Dumping the goroutines helped us understand what was going on:

monitorLeadership is starting leaderLoop and adds to a waitgroup:

consul/agent/consul/leader.go

Lines 74 to 78 in fd3c56f

    
           leaderLoop.Add(1) 
        
           go func(ch chan struct{}) { 
        
           	defer leaderLoop.Done() 
        
           	s.leaderLoop(ch) 
        
           }(weAreLeaderCh)

when raft signals that leadership is lost,monitorLeadership is waiting on that waitgroup before it can finally shutdown. it is done that way to make sure consul only ever runs a single leaderLoop:

consul/agent/consul/leader.go

Lines 88 to 90 in fd3c56f

close(weAreLeaderCh)

leaderLoop.Wait()

weAreLeaderCh = nil
the channel that is used to signal raft leadership has only 1 capacity:

consul/agent/consul/server.go

Line 725 in 66d138f

raftNotifyCh := make(chan bool, 1)

if raft leadership is flapping fast, consul is waiting for the above waitgroup and in the meanwhile raft is blocked because at this time it wrote true to the channel to indicate that leadership was acquired and now it wants to write false, but it can't because it is blocked. Now the raft leaderloop is blocked:

consul/vendor/github.com/hashicorp/raft/raft.go

Lines 426 to 434 in fd3c56f

    
           select { 
        
           case notify <- false: 
        
           case <-r.shutdownCh: 
        
           	// On shutdown, make a best effort but do not block 
        
           	select { 
        
           	case notify <- false: 
        
           	default: 
        
           	} 
        
           }

when autopilot started PromoteNonVoters while it was still leader, but is still running, it now blocks on getting the raft configuration because raft doesn't run the loop anymore since it is waiting to write to the notify channel:

consul/agent/consul/autopilot.go

Lines 67 to 68 in fd3c56f

    
           func (d *AutopilotDelegate) PromoteNonVoters(conf *autopilot.Config, health autopilot.OperatorHealthReply) ([]raft.Server, error) { 
        
           	future := d.server.raft.GetConfiguration()

Fixes

buffer channel more, the docs are even recommending that:

consul/vendor/github.com/hashicorp/raft/config.go

Lines 186 to 189 in fd3c56f

    
           // NotifyCh is used to provide a channel that will be notified of leadership 
        
           // changes. Raft will block writing to this channel, so it should either be 
        
           // buffered or aggressively consumed. 
        
           NotifyCh chan<- bool

restructure consul to decouple aggressive reading from channel from leader loop

Raw goroutine dump of deadlocked server

https://gist.github.com/banks/e6d14f2f94a49bbba48d52023b643008

The text was updated successfully, but these errors were encountered:

Fixes #6852. Increasing the buffer helps recovering from leader flapping. It lowers the chances of the flapping leader to get into a deadlock situation like described in #6852. * [ ] test in an actual cluster

banks · 2019-12-13T15:30:16Z

For posterity, I tried to re-follow this chain and there are a couple of links missing in the description I had to figure out again.

consul/agent/consul/leader.go

Lines 88 to 90 in fd3c56f

close(weAreLeaderCh)

leaderLoop.Wait()

weAreLeaderCh = nil

The chan close terminates leaderLoop which setup a defer s.revokeLeadership() here:

consul/agent/consul/leader.go

Line 192 in fd3c56f

defer s.revokeLeadership()
revokeLeadership stops stuff, including autopilot:

consul/agent/consul/leader.go

Line 361 in fd3c56f

s.autopilot.Stop()
autopilot.Stop closes things and then waits on it's internal wait group for all of it's background health checking to stop (to be sure that the server is not doing any "leaderish" things any more:

consul/agent/consul/autopilot/autopilot.go

Lines 95 to 97 in 3de783c

close(a.shutdownCh)

a.waitGroup.Wait()

a.enabled = false
as explained above, if autopilot has just triggered a promotion (which needs to read raft status) while still leader, that could get blocked on the raft leader loop which is waiting to deliver the next notification about leadership lost.

This is still pretty hard to hold in your head! But a possible sequence of events here between the 3+ goroutines involved in the deadlock cycle could be:

Seq	Raft	Consul MonitorLeadership	Consul leaderLoop	Autopilot
1	Leadership lost
2	`notifyCh <- false`	unblocks
3	starts candidate Loop
4				interval times out, trigger `promoteServers`
5		close `weAreLeader` and wait for `leaderLoop` to return
6			`revokeLeadership` waits for autopilot to stop
7	wins election
8	start new leaderLoop, send `notifyCh <- true` filling chan buffer
9				`PromoteNonVoters` requests raft status (blocked waiting for raft loop to service)
10	Leadership lost again
11	`notifyCh <- false` blocks	blocked waiting for `leaderLoop` to return	blocked waiting for `autopilot.Stop()`	blocked waiting for `raft.Status()` to be serviced by raft

banks · 2019-12-13T15:32:17Z

Bear in mind that this is on a heavily loaded server so some of these operations that should take microseconds are likely struggling to get CPU time and raft is flappy because not being serviced in a timely way. So if this seems implausible/hard to replicate it would be in a controlled way, but once server is overloaded chances of hitting something like this are much higher.

* Increase raft notify buffer. Fixes #6852. Increasing the buffer helps recovering from leader flapping. It lowers the chances of the flapping leader to get into a deadlock situation like described in #6852.

banks added the type/bug Feature does not function as expected label Nov 29, 2019

hanshasselberg mentioned this issue Dec 2, 2019

Increase raft notify buffer. #6863

Merged

hanshasselberg mentioned this issue Jan 7, 2020

Read latest configuration independently from main loop hashicorp/raft#379

Merged

hanshasselberg mentioned this issue Jan 17, 2020

update raft to v1.1.2 #7079

Merged

hanshasselberg closed this as completed in #7079 Jan 20, 2020

hanshasselberg self-assigned this Jan 20, 2020

This was referenced Jan 22, 2020

Handle Nomad leadership flapping (attempt 2) hashicorp/nomad#6977

Merged

In some rare cases possible present 2 leaders in nomad servers cluster hashicorp/nomad#4749

Closed

raft.LeaderCh() always deliver latest transition hashicorp/raft#384

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible deadlock situation when leader is flapping #6852

Possible deadlock situation when leader is flapping #6852

hanshasselberg commented Nov 29, 2019 •

edited

Loading

banks commented Dec 13, 2019 •

edited

Loading

banks commented Dec 13, 2019

Possible deadlock situation when leader is flapping #6852

Possible deadlock situation when leader is flapping #6852

Comments

hanshasselberg commented Nov 29, 2019 • edited Loading

Overview of the Issue

Fixes

Related

Raw goroutine dump of deadlocked server

banks commented Dec 13, 2019 • edited Loading

banks commented Dec 13, 2019

hanshasselberg commented Nov 29, 2019 •

edited

Loading

banks commented Dec 13, 2019 •

edited

Loading