Handle Nomad leadership flapping (attempt 2) #6977

notnoop · 2020-01-22T18:27:53Z

Another attempt at #6951 .

Fixes a deadlock in leadership handling if leadership flapped.

Context

Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in raft.runLeader until the channel is consumed[1] and does not move on to executing follower related logic (in raft.runFollower).

While Raft runLeader defer function blocks, raft cannot process any other raft operations. For example, run{Leader|Follower} methods consume raft.applyCh, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely.

Sadly, leaderLoop and establishLeader makes few Raft calls! establishLeader attempts to auto-create autopilot/scheduler config [3]; and leaderLoop attempts to check raft configuration [4]. All of these calls occur without a timeout.

Thus, if leadership flapped quickly while leaderLoop/establishLeadership is invoked and hit any of these Raft calls, Raft handler deadlock forever.

The above explains some of the observations in #4749 ; though I'm not fully sure I can explain the some of the yamux goroutines - and will follow up there.

Approach

This PR introduces a a helper channel that effectively proxies NotifyCh but drops all but the last value in leadership notification. If raft leadership flaps multiple times, while we attempt to gain or drop leadership, we would only make the step once. One edge case is highlighted when we step down but gain leadership again.

Follow up PRs:

Consul had a similar issue in hashicorp/consul#6852 . I chose a slightly different fix from simply increasing the buffered channel.

We should follow up with few changes related to leadership that Consul introduced:

Upgrade to latest Raft with configuration change: Read latest configuration independently from main loop raft#379
Port Transfer leadership when establishLeadership fails consul#5247 to nomad

Additionally, ideally establishLeadership needs to be fast so Nomad could be fast and we shouldn't attempt to make a Raft apply log transactions there. autopilot/scheduler might make more sense as a watcher for Serf membership changes rather than on leadership changes, e.g. we should create a scheduler config when all servers can honor the log which may happen without a leadership transition (e.g. when last follower upgrades).

[1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193
[2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436
[3]

nomad/nomad/leader.go

Lines 198 to 202 in 2a89e47

    
           s.getOrCreateAutopilotConfig() 
        
           s.autopilot.Start() 
        
           // Initialize scheduler configuration 
        
           s.getOrCreateSchedulerConfig()

[4]

nomad/nomad/leader.go

Line 877 in 2a89e47

configFuture := s.raft.GetConfiguration()

Fixes a deadlock in leadership handling if leadership flapped. Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader` until the channel is consumed[1] and does not move on to executing follower related logic (in `raft.runFollower`). While Raft `runLeader` defer function blocks, raft cannot process any other raft operations. For example, `run{Leader|Follower}` methods consume `raft.applyCh`, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely. Sadly, `leaderLoop` and `establishLeader` makes few Raft calls! `establishLeader` attempts to auto-create autopilot/scheduler config [3]; and `leaderLoop` attempts to check raft configuration [4]. All of these calls occur without a timeout. Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is invoked and hit any of these Raft calls, Raft handler _deadlock_ forever. Depending on how many times it flapped and where exactly we get stuck, I suspect it's possible to get in the following case: * Agent metrics/stats http and RPC calls hang as they check raft.Configurations * raft.State remains in Leader state, and server attempts to handle RPC calls (e.g. node/alloc updates) and these hang as well As we create goroutines per RPC call, the number of goroutines grow over time and may trigger a out of memory errors in addition to missed updates. [1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193 [2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436 [3] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L198-L202 [4] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L877

tgross · 2020-01-22T19:09:58Z

nomad/util_test.go

+	// this select before the implementation goroutine dequeues last value.
+	//
+	// However, never got it to fail in test - so leaving it now to see if it ever fails;
+	// and if/when test fails, we can learn of how much of an issue it is and adjust


We can fail it pretty trivially if we push the values into the channel concurrently to the test thread, but I'm not sure that tells us anything other than we didn't get a chance to consume everything on the channel. If we pull all the values off, we're fine:

package main import ( "fmt" "testing" "time" "github.com/stretchr/testify/assert" ) func TestDropButLastChannel(t *testing.T) { testFunc := func(t *testing.T) { t.Parallel() shutdownCh := make(chan struct{}) src := make(chan bool) dst := dropButLastChannel(src, shutdownCh) timeoutDuration := 1 * time.Millisecond go func() { src <- false src <- false src <- false src <- false src <- false src <- false src <- true src <- false src <- true }() var v bool BREAK: for { select { case v = <-dst: fmt.Println("ok") case <-time.After(timeoutDuration): break BREAK } } assert.True(t, v) close(shutdownCh) } for i := 0; i < 1000; i++ { t.Run(fmt.Sprintf("test-%d", i), testFunc) } }

Correct - if it's running on different goroutine, we have no guarantees of delivery. This feels like another test to add.

Here, I wanted to test that intermediate messages get sent but get dropped when no receive is happening on the channel - so I made the sends happen in the same goroutine. Though, in current form, we still cannot 100% guarantee that the first message we receive is the last sent message, but this hasn't happened in practice yet, hence my comment.

Your test is good to have in that we should check that ultimately, we always send the last message last.

BTW - seeing your test, I realized I didn't support close source channel (raft doesn't attempt to close notify channel); I updated function to handle close signal by always attempting to deliver last known value, just in case someone adopt function for something else in future.

tgross · 2020-01-22T19:13:15Z

nomad/leader.go

-					continue
-				}
+	leaderStep := func(isLeader bool) {
+		switch {


Now that this has been pulled out of the select, could this switch be a if isLeader with an early return?

schmichael · 2020-01-24T23:17:31Z

nomad/leader.go

+				// Raft transitions that haven't been applied to the FSM
+				// yet.
+				// Ensure that that FSM caught up and eval queues are refreshed
+				s.logger.Error("cluster leadership flapped, lost and gained leadership immediately.  Leadership flaps indicate a cluster wide problems (e.g. networking).")


Let's make this Warn since there's no directly actionable error.

schmichael · 2020-01-24T23:33:42Z

nomad/leader.go

+			} else {
+				// Server gained but lost leadership immediately
+				// before it reacted; nothing to do, move on
+				s.logger.Error("cluster leadership flapped, gained and lost leadership immediately.  Leadership flaps indicate a cluster wide problems (e.g. networking).")


I think this might be more accurate? I'm not sure gaining-and-losing-before-establishing leadership indicates a clusterwide problem.

Suggested change

s.logger.Error("cluster leadership flapped, gained and lost leadership immediately. Leadership flaps indicate a cluster wide problems (e.g. networking).")

s.logger.Warn("cluster leadership gained and lost. Could indicate network issues, memory paging, or high CPU load.")

AFAICT we only use the defaults for leadership election (1s timeout causes an election, elections have 1s timeout) and don't expose them for customizing. It seems like clusters without demanding scheduler throughput or latency may prefer higher timeouts to reduce elections (and therefore flapping).

If we allowed Raft to be configurable we could at least point users toward those docs when these cases are encountered?

re: configuration - agreed - we should consider porting consul's raft_multipler [1] and study their defaults. In some Consul testing, they increased their default leadership timeout to 5 seconds to account for low powered instances (e.g. t2.medium).

[1] https://www.consul.io/docs/install/performance.html#minimum-server-requirements

schmichael · 2020-01-24T23:35:08Z

nomad/util.go

+		// wait for first message
+		select {
+		case lv = <-sourceCh:
+			goto ENQUEUE_DST


Unnecessary.

Suggested change

goto ENQUEUE_DST

It's unnecessary indeed. I'd like to keep though just because I find it easier to see all state machine transitions in goto statements.

tgross

LGTM

Always deliver last value then send close signal.

github-actions · 2023-01-19T02:17:26Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mahmood Ali added 2 commits January 22, 2020 10:55

extract leader step function

ccd9c14

notnoop added this to the 0.10.3 milestone Jan 22, 2020

notnoop requested review from schmichael and tgross January 22, 2020 18:27

notnoop self-assigned this Jan 22, 2020

notnoop added this to Triaged in Nomad - Community Issues Triage via automation Jan 22, 2020

tgross reviewed Jan 22, 2020

View reviewed changes

This was referenced Jan 22, 2020

In some rare cases possible present 2 leaders in nomad servers cluster #4749

Closed

Potential fixes for flapping leadership #6951

Closed

schmichael reviewed Jan 24, 2020

View reviewed changes

tgross approved these changes Jan 27, 2020

View reviewed changes

Mahmood Ali added 2 commits January 28, 2020 09:06

include test and address review comments

0912400

handle channel close signal

97f20bd

Always deliver last value then send close signal.

notnoop force-pushed the b-leadership-flapping-2 branch from 03900e2 to 97f20bd Compare January 28, 2020 14:44

Mahmood Ali added 2 commits January 28, 2020 09:49

tweak leadership flapping log messages

94a75b4

tests: defer closing shutdownCh

8ae03c3

notnoop merged commit 771c8ff into master Jan 28, 2020

Nomad - Community Issues Triage automation moved this from Triaged to Done Jan 28, 2020

notnoop deleted the b-leadership-flapping-2 branch January 28, 2020 16:40

tgross mentioned this pull request Mar 3, 2020

goroutine leaks when making nomad api cals #6620

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Nomad leadership flapping (attempt 2) #6977

Handle Nomad leadership flapping (attempt 2) #6977

notnoop commented Jan 22, 2020

tgross Jan 22, 2020

notnoop Jan 22, 2020

notnoop Jan 28, 2020

tgross Jan 22, 2020

schmichael Jan 24, 2020

schmichael Jan 24, 2020

notnoop Jan 24, 2020

schmichael Jan 24, 2020

notnoop Jan 28, 2020

tgross left a comment

github-actions bot commented Jan 19, 2023

	s.getOrCreateAutopilotConfig()
	s.autopilot.Start()

	// Initialize scheduler configuration
	s.getOrCreateSchedulerConfig()

	s.logger.Error("cluster leadership flapped, gained and lost leadership immediately. Leadership flaps indicate a cluster wide problems (e.g. networking).")
	s.logger.Warn("cluster leadership gained and lost. Could indicate network issues, memory paging, or high CPU load.")

Handle Nomad leadership flapping (attempt 2) #6977

Handle Nomad leadership flapping (attempt 2) #6977

Conversation

notnoop commented Jan 22, 2020

Context

Approach

Follow up PRs:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 19, 2023