-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix rare leadership transfer failures when writes happen during transfer #581
Conversation
… happening during the transfer. Fix the problem partially: it can still occur if we dispatchLogs after the transfer message has been sent to the target, before it wins the election.
… changes: we weren't properly respecting stopCh.
…tionTimeout before resuming applies. Fix is in scare quotes because I'm not all sure this is acceptable.
…lel, and improve test log output.
…fail before responding to the transfer request.
…, which is racy in that in between calling pollState and setting up event monitoring, the cluster could've elected a leader. When that happens, Leader() errors returning 0 leaders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great and don't see any issues.
I did have one question in line about whether or not an alternative was easier to read/reason about but I'm not really convinced either way to be honest and I think both are effectively evuivalent so not blocking!
// Wait for up to ElectionTimeout before flagging the | ||
// leadership transfer as done and unblocking applies in | ||
// the leaderLoop. | ||
select { | ||
case <-time.After(r.config().ElectionTimeout): | ||
err := fmt.Errorf("leadership transfer timeout") | ||
r.logger.Debug(err.Error()) | ||
future.respond(err) | ||
case <-leftLeaderLoop: | ||
r.logger.Debug("lost leadership during transfer (expected)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about just staying in the loop here instead of starting a new ElectionTimeout
ticker and having to duplicate the leftLeaderLoop code path too?
I was originally expecting that we'd not wait for another ElectionTimeout
but rather keep waiting for up to the original one we started for the transfer to complete. I don't think it's terrible to do it this way but it strikes me that it's less code and maybe simpler to reason able if the behaviour is just: block until leadership transfer works or one election timeout from handling the request...
The rationale for using ElectionTimeout here presumably was that, LeadershipTransfer is only an optimization to avoid just ungracefully stopping and having to wait for an election timeout... so letting leader transfer take longer than a whole election timeout is a little bit defeating the point and we should probably return fast and let the old leader just shut down if it wants to.
I think that rationale gets a bit more subtle in cases like Autopilot upgrade where we rely on Leadership Transfer working before we move on to the next phase rather than just shutting down anyway.
tl;dr, I don't think the behaviour here is bad or wrong especially and it beats the bug without this wait. If it turns out easier to follow to duplicate the code and make the behavior more explicit like you have here I'm OK with it. Just curious if you tried simply replacing this else
branch with continue
which I think is also a correct fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tried that. It got messy due to the other cases wanting to read from doneCh. In the end I decided this was clearer.
@@ -2337,17 +2337,71 @@ func TestRaft_LeadershipTransferWithOneNode(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestRaft_LeadershipTransferWithWrites(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have anecdata about roughly how frequently this reproduced a bug before your fix locally? I assume it's still non-deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't give you numbers, but it didn't take many tries to provoke a failure. It can still fail if I up the concurrency enough, e.g. if I run 16 parallel -race instances. But I suspect the same is true for many other tests, as the failures are things like timeouts due to heartbeats not being processed quickly enough.
The problem I'm trying to fix: after we send the TimeoutNow during a leader transfer, we remain the leader for a little while. During that time we allow writes, which can result in the upcoming election being lost by our chosen target, if it doesn't have the highest index at the time when it's asking for votes.
The fix: wait for up to ElectionTimeout after the TimeoutNow before we allow writes to proceed.