release-21.1: kv: don't allow node liveness to regress in Gossip network #65357
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 1/1 commits from #64032.
/cc @cockroachdb/release
In #64028, we fixed a long-standing flake in
TestLeaderAfterSplit
. However,the test had actually gotten more flaky recently, which I bisected back to
df826cd. The problem we occasionally see with the test is that all three
replicas of a post-split Range call an election, resulting in a hung vote. Since
the test is configured with RaftElectionTimeoutTicks=1000000, a follow-up
election is never called, so the test times out.
After some debugging, I found that the range would occasionally split while the
non-leaseholder nodes (n2 and n3) thought that the leaseholder node (n1) was not
live. This meant that their call to
shouldCampaignOnWake
in the split triggerconsidered the RHS's epoch-based lease to be invalid (state = ERROR). So all
three replicas would call an election and the test would get stuck.
The offending commit introduced this new flake because of this change:
df826cd#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8R454
Now that the call to
MaybeGossipNodeLiveness
is asynchronous on thenode-liveness range, it was possible for two calls to
MaybeGossipNodeLiveness
to race, one asynchronously triggered by
leasePostApplyLocked
and onesynchronously triggered by
handleReadWriteLocalEvalResult
due to a nodeliveness update. This allowed for the following ordering of events:
One this had occurred, n2 and n3 never again considered n1 live. Gossip never
recovered from this state because the liveness record was never heartbeated
again, due to the test's configuration of
RaftElectionTimeoutTicks=1000000
.This commit fixes the bug by ensuring that all calls to MaybeGossipNodeLiveness
and MaybeGossipSystemConfig hold the raft mutex. This provides the necessary
serialization to avoid data races, which was actually already documented on
MaybeGossipSystemConfig.