Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock situation when leader is flapping #6852

Closed
2 tasks done
hanshasselberg opened this issue Nov 29, 2019 · 2 comments · Fixed by #6863 or #7079
Closed
2 tasks done

Possible deadlock situation when leader is flapping #6852

hanshasselberg opened this issue Nov 29, 2019 · 2 comments · Fixed by #6863 or #7079
Assignees
Labels
type/bug Feature does not function as expected

Comments

@hanshasselberg
Copy link
Member

hanshasselberg commented Nov 29, 2019

Overview of the Issue

When running some load testing, we saw two of the servers having more goroutines than we expected. And even after our load testing was done, the number keept going up. The logs showed that the leader was flapping and one important bit is that the same server lost and acquired leadership in the same second.

Dumping the goroutines helped us understand what was going on:

  • monitorLeadership is starting leaderLoop and adds to a waitgroup:
    leaderLoop.Add(1)
    go func(ch chan struct{}) {
    defer leaderLoop.Done()
    s.leaderLoop(ch)
    }(weAreLeaderCh)
  • when raft signals that leadership is lost,monitorLeadership is waiting on that waitgroup before it can finally shutdown. it is done that way to make sure consul only ever runs a single leaderLoop:
    close(weAreLeaderCh)
    leaderLoop.Wait()
    weAreLeaderCh = nil
  • the channel that is used to signal raft leadership has only 1 capacity:
    raftNotifyCh := make(chan bool, 1)
  • if raft leadership is flapping fast, consul is waiting for the above waitgroup and in the meanwhile raft is blocked because at this time it wrote true to the channel to indicate that leadership was acquired and now it wants to write false, but it can't because it is blocked. Now the raft leaderloop is blocked:
    select {
    case notify <- false:
    case <-r.shutdownCh:
    // On shutdown, make a best effort but do not block
    select {
    case notify <- false:
    default:
    }
    }
  • when autopilot started PromoteNonVoters while it was still leader, but is still running, it now blocks on getting the raft configuration because raft doesn't run the loop anymore since it is waiting to write to the notify channel:
    func (d *AutopilotDelegate) PromoteNonVoters(conf *autopilot.Config, health autopilot.OperatorHealthReply) ([]raft.Server, error) {
    future := d.server.raft.GetConfiguration()

Fixes

  • buffer channel more, the docs are even recommending that:
    // NotifyCh is used to provide a channel that will be notified of leadership
    // changes. Raft will block writing to this channel, so it should either be
    // buffered or aggressively consumed.
    NotifyCh chan<- bool
  • restructure consul to decouple aggressive reading from channel from leader loop

Related

Raw goroutine dump of deadlocked server

https://gist.github.com/banks/e6d14f2f94a49bbba48d52023b643008

@banks banks added the type/bug Feature does not function as expected label Nov 29, 2019
hanshasselberg added a commit that referenced this issue Dec 2, 2019
Fixes #6852.

Increasing the buffer helps recovering from leader flapping. It lowers
the chances of the flapping leader to get into a deadlock situation like
described in #6852.

* [ ] test in an actual cluster
@banks
Copy link
Member

banks commented Dec 13, 2019

For posterity, I tried to re-follow this chain and there are a couple of links missing in the description I had to figure out again.

  • close(weAreLeaderCh)
    leaderLoop.Wait()
    weAreLeaderCh = nil

    The chan close terminates leaderLoop which setup a defer s.revokeLeadership() here:
    defer s.revokeLeadership()
  • revokeLeadership stops stuff, including autopilot:
    s.autopilot.Stop()
  • autopilot.Stop closes things and then waits on it's internal wait group for all of it's background health checking to stop (to be sure that the server is not doing any "leaderish" things any more:
    close(a.shutdownCh)
    a.waitGroup.Wait()
    a.enabled = false
  • as explained above, if autopilot has just triggered a promotion (which needs to read raft status) while still leader, that could get blocked on the raft leader loop which is waiting to deliver the next notification about leadership lost.

This is still pretty hard to hold in your head! But a possible sequence of events here between the 3+ goroutines involved in the deadlock cycle could be:

Seq Raft Consul MonitorLeadership Consul leaderLoop Autopilot
1 Leadership lost
2 notifyCh <- false unblocks
3 starts candidate Loop
4 interval times out, trigger promoteServers
5 close weAreLeader and wait for leaderLoop to return
6 revokeLeadership waits for autopilot to stop
7 wins election
8 start new leaderLoop, send notifyCh <- true filling chan buffer
9 PromoteNonVoters requests raft status (blocked waiting for raft loop to service)
10 Leadership lost again
11 notifyCh <- false blocks blocked waiting for leaderLoop to return blocked waiting for autopilot.Stop() blocked waiting for raft.Status() to be serviced by raft

@banks
Copy link
Member

banks commented Dec 13, 2019

Bear in mind that this is on a heavily loaded server so some of these operations that should take microseconds are likely struggling to get CPU time and raft is flappy because not being serviced in a timely way. So if this seems implausible/hard to replicate it would be in a controlled way, but once server is overloaded chances of hitting something like this are much higher.

@hanshasselberg hanshasselberg self-assigned this Jan 20, 2020
hanshasselberg added a commit that referenced this issue Jan 22, 2020
* Increase raft notify buffer.

Fixes #6852.

Increasing the buffer helps recovering from leader flapping. It lowers
the chances of the flapping leader to get into a deadlock situation like
described in #6852.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants