kv: don't allow node liveness to regress in Gossip network #64032

nvanbenschoten · 2021-04-21T23:58:59Z

In #64028, we fixed a long-standing flake in TestLeaderAfterSplit. However,
the test had actually gotten more flaky recently, which I bisected back to
df826cd. The problem we occasionally see with the test is that all three
replicas of a post-split Range call an election, resulting in a hung vote. Since
the test is configured with RaftElectionTimeoutTicks=1000000, a follow-up
election is never called, so the test times out.

After some debugging, I found that the range would occasionally split while the
non-leaseholder nodes (n2 and n3) thought that the leaseholder node (n1) was not
live. This meant that their call to shouldCampaignOnWake in the split trigger
considered the RHS's epoch-based lease to be invalid (state = ERROR). So all
three replicas would call an election and the test would get stuck.

The offending commit introduced this new flake because of this change:
df826cd#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8R454

Now that the call to MaybeGossipNodeLiveness is asynchronous on the
node-liveness range, it was possible for two calls to MaybeGossipNodeLiveness
to race, one asynchronously triggered by leasePostApplyLocked and one
synchronously triggered by handleReadWriteLocalEvalResult due to a node
liveness update. This allowed for the following ordering of events:

- async call reads liveness(nid:1 epo:0 exp:0,0)
- sync call writes and then reads liveness(nid:1 epo:1 exp:1619645671.921265300,0)
- sync call adds liveness(nid:1 epo:1 exp:1619645671.921265300,0) to gossip
- async call adds liveness(nid:1 epo:0 exp:0,0) to gossip

One this had occurred, n2 and n3 never again considered n1 live. Gossip never
recovered from this state because the liveness record was never heartbeated
again, due to the test's configuration of RaftElectionTimeoutTicks=1000000.

This commit fixes the bug by ensuring that all calls to MaybeGossipNodeLiveness
and MaybeGossipSystemConfig hold the raft mutex. This provides the necessary
serialization to avoid data races, which was actually already documented on
MaybeGossipSystemConfig.

In cockroachdb#64028, we fixed a long-standing flake in `TestLeaderAfterSplit`. However, the test had actually gotten more flaky recently, which I bisected back to df826cd. The problem we occasionally see with the test is that all three replicas of a post-split Range call an election, resulting in a hung vote. Since the test is configured with RaftElectionTimeoutTicks=1000000, a follow-up election is never called, so the test times out. After some debugging, I found that the range would occasionally split while the non-leaseholder nodes (n2 and n3) thought that the leaseholder node (n1) was not live. This meant that their call to `shouldCampaignOnWake` in the split trigger considered the RHS's epoch-based lease to be invalid (state = ERROR). So all three replicas would call an election and the test would get stuck. The offending commit introduced this new flake because of this change: cockroachdb@df826cd#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8R454 Now that the call to `MaybeGossipNodeLiveness` is asynchronous on the node-liveness range, it was possible for two calls to `MaybeGossipNodeLiveness` to race, one asynchronously triggered by `leasePostApplyLocked` and one synchronously triggered by `handleReadWriteLocalEvalResult` due to a node liveness update. This allowed for the following ordering of events: ``` - async call reads liveness(nid:1 epo:0 exp:0,0) - sync call writes and then reads liveness(nid:1 epo:1 exp:1619645671.921265300,0) - sync call adds liveness(nid:1 epo:1 exp:1619645671.921265300,0) to gossip - async call adds liveness(nid:1 epo:0 exp:0,0) to gossip ``` One this had occurred, n2 and n3 never again considered n1 live. Gossip never recovered from this state because the liveness record was never heartbeated again, due to the test's configuration of `RaftElectionTimeoutTicks=1000000`. This commit fixes the bug by ensuring that all calls to MaybeGossipNodeLiveness and MaybeGossipSystemConfig hold the raft mutex. This provides the necessary serialization to avoid data races, which was actually already documented on MaybeGossipSystemConfig.

cockroach-teamcity · 2021-04-21T23:59:06Z

This change is

tbg

Thank you for tracking this down!

Reviewed 3 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

nvanbenschoten · 2021-04-22T15:38:16Z

TFTR! Do you have thoughts about backporting this to release-21.1? I'm a little hesitant to do so for the v21.1.0 release because we've never seen real issues from this (that I know of), but think it's a good candidate for v21.1.1.

bors r+

craig · 2021-04-22T17:29:57Z

Build succeeded:

GitHub CI (Cockroach)

tbg · 2021-04-23T09:07:32Z

21.1.1 sounds good to me. I generally always prefer to let things bake if we can afford it and here it seems like we can.

…

On Thu, Apr 22, 2021 at 7:30 PM craig[bot] ***@***.***> wrote: Merged #64032 <#64032> into master. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#64032 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGXPZCSRBDNLB4P22ZJHXTTKBMKZANCNFSM43LLCUAA> .

nvanbenschoten requested a review from tbg April 21, 2021 23:58

tbg approved these changes Apr 22, 2021

View reviewed changes

nvanbenschoten added the backport-21.1.x 21.1 is EOL label Apr 22, 2021

craig bot merged commit 5f40d69 into cockroachdb:master Apr 22, 2021

nvanbenschoten deleted the nvanbenschoten/fixLeaderAfterSplit2 branch April 27, 2021 02:39

nvanbenschoten mentioned this pull request May 17, 2021

release-21.1: kv: don't allow node liveness to regress in Gossip network #65357

Merged

andrewbaptist mentioned this pull request Mar 3, 2023

kvserver: avoid frequent expensive scan on liveness lease extension #97966

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: don't allow node liveness to regress in Gossip network #64032

kv: don't allow node liveness to regress in Gossip network #64032

nvanbenschoten commented Apr 21, 2021

cockroach-teamcity commented Apr 21, 2021

tbg left a comment

nvanbenschoten commented Apr 22, 2021

craig bot commented Apr 22, 2021

tbg commented Apr 23, 2021 via email

kv: don't allow node liveness to regress in Gossip network #64032

kv: don't allow node liveness to regress in Gossip network #64032

Conversation

nvanbenschoten commented Apr 21, 2021

cockroach-teamcity commented Apr 21, 2021

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Apr 22, 2021

craig bot commented Apr 22, 2021

tbg commented Apr 23, 2021 via email