Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential fixes for flapping leadership #6951

Closed
wants to merge 2 commits into from
Closed

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jan 16, 2020

This PR attempts to pick up some issues found in #4749 . Issue 4749 has some other issues, but this addresses one of the problems associated with leadership flapping.

In particular, this PR addresses two issues:

First, if a non-leader is promoted but then immediately loses leadership, don't bother establishing it and attempting Barrier raft transactions. By using a buffered channel, before establishing leadership, we peek into rest of channel to see lost it before attempting to establish leadership. This should only occur when leadership is flapping while leaderLoop is running/shutting down.

Second, this fixes a condition where we may reattempt to reconcile and establish leadership even after we lost raft leadership.

Consider the case where a step down occurs during the leaderLoop Barrier call and/or it times out. The Barrier call times out after 2m by default, but reconcile interval defaults to 1m. Thus, both <-stopCh and <-interval clauses are ready in the WAIT select statement. Golang may arbitrary chose one, resulting into potentially unnecessary Barrier call.

Here, we prioritize honoring stopCh and ensure we return early if Barrier or reconciliation fails.

Mahmood Ali added 2 commits January 16, 2020 10:09
This fixes a condition where we may attempt to reconcile and establish
leadership even after `stopCh` or `shutdownCh` is closed.

Consider the case where a step down occurs during the leaderLoop Barrier
call and/or it times out.  The Barrier call times out after 2m by
default, but reconcile interval defaults to 1m.  Thus, both `<-stopCh` and
`<-interval` clauses are ready in the `WAIT` select statement.  Golang
may arbitrary chose one, resulting into potentially unnecessary Barrier
call.

Here, we prioratize honoring `stopCh` and ensure we return early if
Barrier or reconcilation fails.
If a non-leader is promoted but then immediately loses leadership, don't
bother establishing it and attempting Barrier raft transactions
@notnoop notnoop added this to the 0.10.4 milestone Jan 16, 2020
@notnoop notnoop self-assigned this Jan 16, 2020
@notnoop notnoop added this to Triaged in Nomad - Community Issues Triage via automation Jan 16, 2020
@notnoop notnoop moved this from Triaged to In Progress in Nomad - Community Issues Triage Jan 16, 2020

// if gained and lost leadership immediately, move on without emitting error
if suppressed && !isLeader && weAreLeaderCh == nil {
s.logger.Info("cluster leadership acquired but lost immediately")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does flapping indicate something an operator should be aware of such as excessive cpu load or network issues? If so we might want to document that in at least a code comment as I'm worried this log message may flummox the first users who hit it. If it does hint at some degenerate state, users who see this might already be stressed and confused, so if there's any more hints or context we can offer we should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this, I've often found as an operator that info log level is indecisive. If it's something I need to pay attention to and take action on, it should probably be at warn. Otherwise it's telling me "hey this is something you want to know about" without telling me what to do with that information.

(My rule of thumb has often been that info is for information at startup like "I'm listening on port 1234 and running with plugins X and Y", and then not used after that, but I don't think we have firm guidelines anywhere on Nomad.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, are this eliminate leaderCh overflow? I think that is no. For example this is not remove situation when leadership lost, and you just increase leaderCh buffer size which is not enough imho in all cases

// Returns:
// leader: last buffered leadership state
// suppressed: true if method dequeued elements from channel
func suppressLeadershipFlaps(isLeader bool, ch <-chan bool) (leader, suppressed bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively what we're doing here is that if we get a notification that we're now the leader, we drain the notification channel to make sure that there hasn't been more than one notification since we handled the last one (in the outer loop of monitorLeadership).

While this solves for a very tight loop of leadership flapping, I think we can still get into a flapping state if the leadership transitions are happening at a rate which is slower than the outer loop of monitorLeadership -- in that case there will only ever be one notification in the queue at a time but the leadership loop will still be continuously flapping.

(Moving the WAIT: label in leaderLoop on the other hand I suspect will be really valuable for more realistic flapping cases.)


// if gained and lost leadership immediately, move on without emitting error
if suppressed && !isLeader && weAreLeaderCh == nil {
s.logger.Info("cluster leadership acquired but lost immediately")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this, I've often found as an operator that info log level is indecisive. If it's something I need to pay attention to and take action on, it should probably be at warn. Otherwise it's telling me "hey this is something you want to know about" without telling me what to do with that information.

(My rule of thumb has often been that info is for information at startup like "I'm listening on port 1234 and running with plugins X and Y", and then not used after that, but I don't think we have firm guidelines anywhere on Nomad.)

// Setup the leader channel
leaderCh := make(chan bool, 1)
// Set up a channel for reliable leader notifications.
leaderCh := make(chan bool, 10)
Copy link
Contributor

@tantra35 tantra35 Jan 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-) Hm, in many cases, this will be enough, and why you extend this value? You already add suppressLeadershipFlaps function which doesn't allow to overflow leaderCh. Just in case?

@notnoop
Copy link
Contributor Author

notnoop commented Jan 23, 2020

Closing this one as it's superseeded by #6977

@notnoop notnoop closed this Jan 23, 2020
Nomad - Community Issues Triage automation moved this from In Progress to Done Jan 23, 2020
@schmichael schmichael modified the milestones: 0.10.4, 0.10.3 Jan 30, 2020
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants