release-21.1: kv: don't clear raftRequestQueue of right-hand side of Range split #65356

nvanbenschoten · 2021-05-17T22:37:03Z

Backport 1/1 commits from #64028.

/cc @cockroachdb/release

This commit fixes a test flake of TestLeaderAfterSplit I observed in CI and
which we've seen at least once in #43564 (comment).
I bisected the flake back to a591707, but that wasn't the real source of the
flakiness - the move from multiTestContext to TestCluster just changed
transport mechanism between replicas and revealed an existing bug.

The real issue here was that, upon applying a split, any previously established
raftRequestQueue to the RHS replica was discarded. The effect of this is that
we could see the following series of events:

1. r1 is created from a split
2. r1 campaigns to establish a leader for the new range
3. r1 sends MsgPreVote msgs to r2 and r3
4. s2 and s3 both receive the messages for the uninitialized r2 and r3, respectively.
5. raftRequestQueues are established for r2 and r3, and the MsgPreVotes are added
6. the split triggers to create r2 and r3 finally fire
7. the raftRequestQueues for r2 and r3 are discarded
8. the election stalls indefinitely, because the test sets RaftElectionTimeoutTicks=1000000

Of course, in real deployments, RaftElectionTimeoutTicks will never be set so
high, so a new election will be called again after about 3 seconds. Still, this
could cause unavailability immediately after a split for about 3s even in real
deployments, so it seems worthwhile to fix.

This commit fixes the issue by removing the logic to discard an uninitialized
replica's raftRequestQueue upon applying a split that initializes the replica.
That logic looks quite intentional, but if we look back at when it was added, we
see that it wasn't entirely deliberate. The code was added in d3b0e73, which
extracted everything except the call to s.mu.replicas.Delete(int64(rangeID))
from unlinkReplicaByRangeIDLocked. So the change wasn't intentionally
discarding the queue, it was just trying not to change the existing behavior.

This change is safe and does not risk leaking the raftRequestQueue because
we are removing from s.mu.uninitReplicas but will immediately call into
addReplicaInternalLocked to add an initialized replica.

Release notes (bug fix): Fix a rare race that could lead to a 3 second stall
before a Raft leader was elected on a Range immediately after it was split off
from its left-hand neighbor.

This commit fixes a test flake of `TestLeaderAfterSplit` I observed in CI and which we've seen at least once in cockroachdb#43564 (comment). I bisected the flake back to a591707, but that wasn't the real source of the flakiness - the move from `multiTestContext` to `TestCluster` just changed transport mechanism between replicas and revealed an existing bug. The real issue here was that, upon applying a split, any previously established `raftRequestQueue` to the RHS replica was discarded. The effect of this is that we could see the following series of events: ``` 1. r1 is created from a split 2. r1 campaigns to establish a leader for the new range 3. r1 sends MsgPreVote msgs to r2 and r3 4. s2 and s3 both receive the messages for the uninitialized r2 and r3, respectively. 5. raftRequestQueues are established for r2 and r3, and the MsgPreVotes are added 6. the split triggers to create r2 and r3 finally fire 7. the raftRequestQueues for r2 and r3 are discarded 8. the election stalls indefinitely, because the test sets RaftElectionTimeoutTicks=1000000 ``` Of course, in real deployments, `RaftElectionTimeoutTicks` will never be set so high, so a new election will be called again after about 3 seconds. Still, this could cause unavailability immediately after a split for about 3s even in real deployments, so it seems worthwhile to fix. This commit fixes the issue by removing the logic to discard an uninitialized replica's `raftRequestQueue` upon applying a split that initializes the replica. That logic looks quite intentional, but if we look back at when it was added, we see that it wasn't entirely deliberate. The code was added in d3b0e73, which extracted everything except the call to `s.mu.replicas.Delete(int64(rangeID))` from `unlinkReplicaByRangeIDLocked`. So the change wasn't intentionally discarding the queue, it was just trying not to change the existing behavior. This change is safe and does not risk leaking the `raftRequestQueue` because we are removing from `s.mu.uninitReplicas` but will immediately call into `addReplicaInternalLocked` to add an initialized replica. Release notes (bug fix): Fix a rare race that could lead to a 3 second stall before a Raft leader was elected on a Range immediately after it was split off from its left-hand neighbor.

cockroach-teamcity · 2021-05-17T22:38:12Z

This change is

nvanbenschoten requested a review from tbg May 17, 2021 22:37

tbg approved these changes May 18, 2021

View reviewed changes

nvanbenschoten merged commit f0534f7 into cockroachdb:release-21.1 May 18, 2021

nvanbenschoten deleted the backport21.1-64028 branch June 11, 2021 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-21.1: kv: don't clear raftRequestQueue of right-hand side of Range split #65356

release-21.1: kv: don't clear raftRequestQueue of right-hand side of Range split #65356

nvanbenschoten commented May 17, 2021

cockroach-teamcity commented May 17, 2021

release-21.1: kv: don't clear raftRequestQueue of right-hand side of Range split #65356

release-21.1: kv: don't clear raftRequestQueue of right-hand side of Range split #65356

Conversation

nvanbenschoten commented May 17, 2021

cockroach-teamcity commented May 17, 2021