-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: don't clear raftRequestQueue of right-hand side of Range split #64028
Merged
craig
merged 1 commit into
cockroachdb:master
from
nvanbenschoten:nvanbenschoten/fixLeaderAfterSplit1
Apr 22, 2021
Merged
kv: don't clear raftRequestQueue of right-hand side of Range split #64028
craig
merged 1 commit into
cockroachdb:master
from
nvanbenschoten:nvanbenschoten/fixLeaderAfterSplit1
Apr 22, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit fixes a test flake of `TestLeaderAfterSplit` I observed in CI and which we've seen at least once in cockroachdb#43564 (comment). I bisected the flake back to a591707, but that wasn't the real source of the flakiness - the move from `multiTestContext` to `TestCluster` just changed transport mechanism between replicas and revealed an existing bug. The real issue here was that, upon applying a split, any previously established `raftRequestQueue` to the RHS replica was discarded. The effect of this is that we could see the following series of events: ``` 1. r1 is created from a split 2. r1 campaigns to establish a leader for the new range 3. r1 sends MsgPreVote msgs to r2 and r3 4. s2 and s3 both receive the messages for the uninitialized r2 and r3, respectively. 5. raftRequestQueues are established for r2 and r3, and the MsgPreVotes are added 6. the split triggers to create r2 and r3 finally fire 7. the raftRequestQueues for r2 and r3 are discarded 8. the election stalls indefinitely, because the test sets RaftElectionTimeoutTicks=1000000 ``` Of course, in real deployments, `RaftElectionTimeoutTicks` will never be set so high, so a new election will be called again after about 3 seconds. Still, this could cause unavailability immediately after a split for about 3s even in real deployments, so it seems worthwhile to fix. This commit fixes the issue by removing the logic to discard an uninitialized replica's `raftRequestQueue` upon applying a split that initializes the replica. That logic looks quite intentional, but if we look back at when it was added, we see that it wasn't entirely deliberate. The code was added in d3b0e73, which extracted everything except the call to `s.mu.replicas.Delete(int64(rangeID))` from `unlinkReplicaByRangeIDLocked`. So the change wasn't intentionally discarding the queue, it was just trying not to change the existing behavior. This change is safe and does not risk leaking the `raftRequestQueue` because we are removing from `s.mu.uninitReplicas` but will immediately call into `addReplicaInternalLocked` to add an initialized replica. Release notes (bug fix): Fix a rare race that could lead to a 3 second stall before a Raft leader was elected on a Range immediately after it was split off from its left-hand neighbor.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this pull request
Apr 21, 2021
In cockroachdb#64028, we fixed a long-standing flake in `TestLeaderAfterSplit`. However, the test had actually gotten more flaky recently, which I bisected back to df826cd. The problem we occasionally see with the test is that all three replicas of a post-split Range call an election, resulting in a hung vote. Since the test is configured with RaftElectionTimeoutTicks=1000000, a follow-up election is never called, so the test times out. After some debugging, I found that the range would occasionally split while the non-leaseholder nodes (n2 and n3) thought that the leaseholder node (n1) was not live. This meant that their call to `shouldCampaignOnWake` in the split trigger considered the RHS's epoch-based lease to be invalid (state = ERROR). So all three replicas would call an election and the test would get stuck. The offending commit introduced this new flake because of this change: cockroachdb@df826cd#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8R454 Now that the call to `MaybeGossipNodeLiveness` is asynchronous on the node-liveness range, it was possible for two calls to `MaybeGossipNodeLiveness` to race, one asynchronously triggered by `leasePostApplyLocked` and one synchronously triggered by `handleReadWriteLocalEvalResult` due to a node liveness update. This allowed for the following ordering of events: ``` - async call reads liveness(nid:1 epo:0 exp:0,0) - sync call writes and then reads liveness(nid:1 epo:1 exp:1619645671.921265300,0) - sync call adds liveness(nid:1 epo:1 exp:1619645671.921265300,0) to gossip - async call adds liveness(nid:1 epo:0 exp:0,0) to gossip ``` One this had occurred, n2 and n3 never again considered n1 live. Gossip never recovered from this state because the liveness record was never heartbeated again, due to the test's configuration of `RaftElectionTimeoutTicks=1000000`. This commit fixes the bug by ensuring that all calls to MaybeGossipNodeLiveness and MaybeGossipSystemConfig hold the raft mutex. This provides the necessary serialization to avoid data races, which was actually already documented on MaybeGossipSystemConfig.
tbg
approved these changes
Apr 22, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find.
TFTR! I plan to backport this to v21.1.1. bors r+ |
Build succeeded: |
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this pull request
May 17, 2021
In cockroachdb#64028, we fixed a long-standing flake in `TestLeaderAfterSplit`. However, the test had actually gotten more flaky recently, which I bisected back to df826cd. The problem we occasionally see with the test is that all three replicas of a post-split Range call an election, resulting in a hung vote. Since the test is configured with RaftElectionTimeoutTicks=1000000, a follow-up election is never called, so the test times out. After some debugging, I found that the range would occasionally split while the non-leaseholder nodes (n2 and n3) thought that the leaseholder node (n1) was not live. This meant that their call to `shouldCampaignOnWake` in the split trigger considered the RHS's epoch-based lease to be invalid (state = ERROR). So all three replicas would call an election and the test would get stuck. The offending commit introduced this new flake because of this change: cockroachdb@df826cd#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8R454 Now that the call to `MaybeGossipNodeLiveness` is asynchronous on the node-liveness range, it was possible for two calls to `MaybeGossipNodeLiveness` to race, one asynchronously triggered by `leasePostApplyLocked` and one synchronously triggered by `handleReadWriteLocalEvalResult` due to a node liveness update. This allowed for the following ordering of events: ``` - async call reads liveness(nid:1 epo:0 exp:0,0) - sync call writes and then reads liveness(nid:1 epo:1 exp:1619645671.921265300,0) - sync call adds liveness(nid:1 epo:1 exp:1619645671.921265300,0) to gossip - async call adds liveness(nid:1 epo:0 exp:0,0) to gossip ``` One this had occurred, n2 and n3 never again considered n1 live. Gossip never recovered from this state because the liveness record was never heartbeated again, due to the test's configuration of `RaftElectionTimeoutTicks=1000000`. This commit fixes the bug by ensuring that all calls to MaybeGossipNodeLiveness and MaybeGossipSystemConfig hold the raft mutex. This provides the necessary serialization to avoid data races, which was actually already documented on MaybeGossipSystemConfig.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit fixes a test flake of
TestLeaderAfterSplit
I observed in CI andwhich we've seen at least once in #43564 (comment).
I bisected the flake back to a591707, but that wasn't the real source of the
flakiness - the move from
multiTestContext
toTestCluster
just changedtransport mechanism between replicas and revealed an existing bug.
The real issue here was that, upon applying a split, any previously established
raftRequestQueue
to the RHS replica was discarded. The effect of this is thatwe could see the following series of events:
Of course, in real deployments,
RaftElectionTimeoutTicks
will never be set sohigh, so a new election will be called again after about 3 seconds. Still, this
could cause unavailability immediately after a split for about 3s even in real
deployments, so it seems worthwhile to fix.
This commit fixes the issue by removing the logic to discard an uninitialized
replica's
raftRequestQueue
upon applying a split that initializes the replica.That logic looks quite intentional, but if we look back at when it was added, we
see that it wasn't entirely deliberate. The code was added in d3b0e73, which
extracted everything except the call to
s.mu.replicas.Delete(int64(rangeID))
from
unlinkReplicaByRangeIDLocked
. So the change wasn't intentionallydiscarding the queue, it was just trying not to change the existing behavior.
This change is safe and does not risk leaking the
raftRequestQueue
becausewe are removing from
s.mu.uninitReplicas
but will immediately call intoaddReplicaInternalLocked
to add an initialized replica.Release notes (bug fix): Fix a rare race that could lead to a 3 second stall
before a Raft leader was elected on a Range immediately after it was split off
from its left-hand neighbor.