Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: don't allow node liveness to regress in Gossip network #64032

Merged

Commits on Apr 21, 2021

  1. kv: don't allow node liveness to regress in Gossip network

    In cockroachdb#64028, we fixed a long-standing flake in `TestLeaderAfterSplit`. However,
    the test had actually gotten more flaky recently, which I bisected back to
    df826cd. The problem we occasionally see with the test is that all three
    replicas of a post-split Range call an election, resulting in a hung vote. Since
    the test is configured with RaftElectionTimeoutTicks=1000000, a follow-up
    election is never called, so the test times out.
    
    After some debugging, I found that the range would occasionally split while the
    non-leaseholder nodes (n2 and n3) thought that the leaseholder node (n1) was not
    live. This meant that their call to `shouldCampaignOnWake` in the split trigger
    considered the RHS's epoch-based lease to be invalid (state = ERROR). So all
    three replicas would call an election and the test would get stuck.
    
    The offending commit introduced this new flake because of this change:
    cockroachdb@df826cd#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8R454
    
    Now that the call to `MaybeGossipNodeLiveness` is asynchronous on the
    node-liveness range, it was possible for two calls to `MaybeGossipNodeLiveness`
    to race, one asynchronously triggered by `leasePostApplyLocked` and one
    synchronously triggered by `handleReadWriteLocalEvalResult` due to a node
    liveness update. This allowed for the following ordering of events:
    ```
    - async call reads liveness(nid:1 epo:0 exp:0,0)
    - sync call writes and then reads liveness(nid:1 epo:1 exp:1619645671.921265300,0)
    - sync call adds liveness(nid:1 epo:1 exp:1619645671.921265300,0) to gossip
    - async call adds liveness(nid:1 epo:0 exp:0,0) to gossip
    ```
    
    One this had occurred, n2 and n3 never again considered n1 live. Gossip never
    recovered from this state because the liveness record was never heartbeated
    again, due to the test's configuration of `RaftElectionTimeoutTicks=1000000`.
    
    This commit fixes the bug by ensuring that all calls to MaybeGossipNodeLiveness
    and MaybeGossipSystemConfig hold the raft mutex. This provides the necessary
    serialization to avoid data races, which was actually already documented on
    MaybeGossipSystemConfig.
    nvanbenschoten committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    3e4adfc View commit details
    Browse the repository at this point in the history