Potential fixes for flapping leadership #6951

notnoop · 2020-01-16T21:11:35Z

This PR attempts to pick up some issues found in #4749 . Issue 4749 has some other issues, but this addresses one of the problems associated with leadership flapping.

In particular, this PR addresses two issues:

First, if a non-leader is promoted but then immediately loses leadership, don't bother establishing it and attempting Barrier raft transactions. By using a buffered channel, before establishing leadership, we peek into rest of channel to see lost it before attempting to establish leadership. This should only occur when leadership is flapping while leaderLoop is running/shutting down.

Second, this fixes a condition where we may reattempt to reconcile and establish leadership even after we lost raft leadership.

Consider the case where a step down occurs during the leaderLoop Barrier call and/or it times out. The Barrier call times out after 2m by default, but reconcile interval defaults to 1m. Thus, both <-stopCh and <-interval clauses are ready in the WAIT select statement. Golang may arbitrary chose one, resulting into potentially unnecessary Barrier call.

Here, we prioritize honoring stopCh and ensure we return early if Barrier or reconciliation fails.

This fixes a condition where we may attempt to reconcile and establish leadership even after `stopCh` or `shutdownCh` is closed. Consider the case where a step down occurs during the leaderLoop Barrier call and/or it times out. The Barrier call times out after 2m by default, but reconcile interval defaults to 1m. Thus, both `<-stopCh` and `<-interval` clauses are ready in the `WAIT` select statement. Golang may arbitrary chose one, resulting into potentially unnecessary Barrier call. Here, we prioratize honoring `stopCh` and ensure we return early if Barrier or reconcilation fails.

If a non-leader is promoted but then immediately loses leadership, don't bother establishing it and attempting Barrier raft transactions

schmichael · 2020-01-16T23:04:00Z

nomad/leader.go

+
+			// if gained and lost leadership immediately, move on without emitting error
+			if suppressed && !isLeader && weAreLeaderCh == nil {
+				s.logger.Info("cluster leadership acquired but lost immediately")


Does flapping indicate something an operator should be aware of such as excessive cpu load or network issues? If so we might want to document that in at least a code comment as I'm worried this log message may flummox the first users who hit it. If it does hint at some degenerate state, users who see this might already be stressed and confused, so if there's any more hints or context we can offer we should.

To add to this, I've often found as an operator that info log level is indecisive. If it's something I need to pay attention to and take action on, it should probably be at warn. Otherwise it's telling me "hey this is something you want to know about" without telling me what to do with that information.

(My rule of thumb has often been that info is for information at startup like "I'm listening on port 1234 and running with plugins X and Y", and then not used after that, but I don't think we have firm guidelines anywhere on Nomad.)

Hm, are this eliminate leaderCh overflow? I think that is no. For example this is not remove situation when leadership lost, and you just increase leaderCh buffer size which is not enough imho in all cases

tgross · 2020-01-17T13:56:42Z

nomad/leader.go

+// Returns:
+//   leader: last buffered leadership state
+//   suppressed: true if method dequeued elements from channel
+func suppressLeadershipFlaps(isLeader bool, ch <-chan bool) (leader, suppressed bool) {


Effectively what we're doing here is that if we get a notification that we're now the leader, we drain the notification channel to make sure that there hasn't been more than one notification since we handled the last one (in the outer loop of monitorLeadership).

While this solves for a very tight loop of leadership flapping, I think we can still get into a flapping state if the leadership transitions are happening at a rate which is slower than the outer loop of monitorLeadership -- in that case there will only ever be one notification in the queue at a time but the leadership loop will still be continuously flapping.

(Moving the WAIT: label in leaderLoop on the other hand I suspect will be really valuable for more realistic flapping cases.)

tgross · 2020-01-17T14:00:53Z

nomad/leader.go

+
+			// if gained and lost leadership immediately, move on without emitting error
+			if suppressed && !isLeader && weAreLeaderCh == nil {
+				s.logger.Info("cluster leadership acquired but lost immediately")


To add to this, I've often found as an operator that info log level is indecisive. If it's something I need to pay attention to and take action on, it should probably be at warn. Otherwise it's telling me "hey this is something you want to know about" without telling me what to do with that information.

(My rule of thumb has often been that info is for information at startup like "I'm listening on port 1234 and running with plugins X and Y", and then not used after that, but I don't think we have firm guidelines anywhere on Nomad.)

tantra35 · 2020-01-20T12:49:15Z

nomad/server.go

-	// Setup the leader channel
-	leaderCh := make(chan bool, 1)
+	// Set up a channel for reliable leader notifications.
+	leaderCh := make(chan bool, 10)


:-) Hm, in many cases, this will be enough, and why you extend this value? You already add suppressLeadershipFlaps function which doesn't allow to overflow leaderCh. Just in case?

notnoop · 2020-01-23T20:58:18Z

Closing this one as it's superseeded by #6977

github-actions · 2023-01-19T02:17:12Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mahmood Ali added 2 commits January 16, 2020 10:09

nomad: optimize against leadership flapping

4d60218

If a non-leader is promoted but then immediately loses leadership, don't bother establishing it and attempting Barrier raft transactions

notnoop added the theme/raft label Jan 16, 2020

notnoop added this to the 0.10.4 milestone Jan 16, 2020

notnoop requested review from schmichael, preetapan and tgross January 16, 2020 21:11

notnoop self-assigned this Jan 16, 2020

notnoop added this to Triaged in Nomad - Community Issues Triage via automation Jan 16, 2020

notnoop moved this from Triaged to In Progress in Nomad - Community Issues Triage Jan 16, 2020

schmichael reviewed Jan 16, 2020

View reviewed changes

tgross reviewed Jan 17, 2020

View reviewed changes

tantra35 reviewed Jan 20, 2020

View reviewed changes

notnoop mentioned this pull request Jan 22, 2020

Handle Nomad leadership flapping (attempt 2) #6977

Merged

notnoop closed this Jan 23, 2020

Nomad - Community Issues Triage automation moved this from In Progress to Done Jan 23, 2020

schmichael modified the milestones: 0.10.4, 0.10.3 Jan 30, 2020

github-actions bot locked as resolved and limited conversation to collaborators Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential fixes for flapping leadership #6951

Potential fixes for flapping leadership #6951

notnoop commented Jan 16, 2020

schmichael Jan 16, 2020

tgross Jan 17, 2020

tantra35 Jan 20, 2020

tgross Jan 17, 2020

tgross Jan 17, 2020

tantra35 Jan 20, 2020 •

edited

Loading

notnoop commented Jan 23, 2020

github-actions bot commented Jan 19, 2023

Potential fixes for flapping leadership #6951

Potential fixes for flapping leadership #6951

Conversation

notnoop commented Jan 16, 2020

schmichael Jan 16, 2020

Choose a reason for hiding this comment

tgross Jan 17, 2020

Choose a reason for hiding this comment

tantra35 Jan 20, 2020

Choose a reason for hiding this comment

tgross Jan 17, 2020

Choose a reason for hiding this comment

tgross Jan 17, 2020

Choose a reason for hiding this comment

tantra35 Jan 20, 2020 • edited Loading

Choose a reason for hiding this comment

notnoop commented Jan 23, 2020

github-actions bot commented Jan 19, 2023

tantra35 Jan 20, 2020 •

edited

Loading