Skip to content

Commit

Permalink
kvserver: avoid race in preSplitApply
Browse files Browse the repository at this point in the history
When `splitPreApply` has to handle a right-hand side replica that is
newer than the split, the split needs to throw the "snapshot" it was
going to install into the right-hand side away. It does so by deleting
all data in the RHS and replacing the raft state bits. It is using
the RHS replica's stateloader to that effect, but didn't actually
hold the raftMu to make this safe. The mutex acquisition has been
added.

Fixes #86669.

No release note since the bug shouldn't be visible to end users (it is
very rare in the first place, and having noticeable effect even rarer),
and if so it would likely look like unspecific Raft corruption that will
be hard to trace back to this race.

Release justification: this will merge on master only after branch cut.
Release note: None
  • Loading branch information
tbg committed Aug 24, 2022
1 parent 0f3bf61 commit f19bb2a
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 20 deletions.
44 changes: 24 additions & 20 deletions pkg/kv/kvserver/replica.go
Original file line number Diff line number Diff line change
Expand Up @@ -1705,31 +1705,35 @@ func (r *Replica) shouldWaitForPendingMergeRLocked(

// isNewerThanSplit is a helper used in split(Pre|Post)Apply to
// determine whether the Replica on the right hand side of the split must
// have been removed from this store after the split. There is one
// false negative where false will be returned but the hard state may
// be due to a newer replica which is outlined below. It should be safe.
// have been removed from this store after the split.
//
// TODO(ajwerner): Ideally if this store had ever learned that the replica
// created by the split were removed it would not forget that fact.
// There exists one edge case where the store may learn that it should house
// a replica of the same range with a higher replica ID and then forget.
// If the first raft message this store ever receives for the this range
// contains a replica ID higher than the replica ID in the split trigger
// then an in-memory replica at that higher replica ID will be created and
// no tombstone at a lower replica ID will be written. If the server then
// crashes it will forget that it had ever been the higher replica ID. The
// server may then proceed to process the split and initialize a replica at
// the replica ID implied by the split. This is potentially problematic as
// TODO(tbg): the below is true as of 22.2: we persist any Replica's ReplicaID
// under RaftReplicaIDKey, so the below caveats should be addressed now and we
// should be able to simplify isNewerThanSplit to just compare replicaIDs.
//
// TODO(ajwerner): There is one false negative where false will be returned but
// the hard state may be due to a newer replica which is outlined below. It
// should be safe.
// Ideally if this store had ever learned that the replica created by the split
// were removed it would not forget that fact. There exists one edge case where
// the store may learn that it should house a replica of the same range with a
// higher replica ID and then forget. If the first raft message this store ever
// receives for the this range contains a replica ID higher than the replica ID
// in the split trigger then an in-memory replica at that higher replica ID will
// be created and no tombstone at a lower replica ID will be written. If the
// server then crashes it will forget that it had ever been the higher replica
// ID. The server may then proceed to process the split and initialize a replica
// at the replica ID implied by the split. This is potentially problematic as
// the replica may have voted as this higher replica ID and when it rediscovers
// the higher replica ID it will delete all of the state corresponding to the
// older replica ID including its hard state which may have been synthesized
// with votes as the newer replica ID. This case tends to be handled safely
// in practice because the replica should only be receiving messages as the
// newer replica ID after it has been added to the range as a learner.
// with votes as the newer replica ID. This case tends to be handled safely in
// practice because the replica should only be receiving messages as the newer
// replica ID after it has been added to the range as a learner.
//
// Despite the safety due to the change replicas protocol explained above
// it'd be good to know for sure that a replica ID for a range on a store
// is always monotonically increasing, even across restarts.
// Despite the safety due to the change replicas protocol explained above it'd
// be good to know for sure that a replica ID for a range on a store is always
// monotonically increasing, even across restarts.
//
// See TestProcessSplitAfterRightHandSideHasBeenRemoved.
func (r *Replica) isNewerThanSplit(split *roachpb.SplitTrigger) bool {
Expand Down
10 changes: 10 additions & 0 deletions pkg/kv/kvserver/store_split.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ func splitPreApply(
// We're in the rare case where we know that the RHS has been removed
// and re-added with a higher replica ID (and then maybe removed again).
//
// If rightRepl is not nil, we are *not* holding raftMu.
//
// To apply the split, we need to "throw away" the data that would belong to
// the RHS, i.e. we clear the user data the RHS would have inherited from the
// LHS due to the split and additionally clear all of the range ID local state
Expand All @@ -80,8 +82,16 @@ func splitPreApply(
// the HardState and tombstone. Note that we only do this if rightRepl
// exists; if it doesn't, there's no Raft state to massage (when rightRepl
// was removed, a tombstone was written instead).
//
// TODO(tbg): it would be cleaner to teach clearRangeData to only remove
// the replicated state if rightRepl != nil, as opposed to writing back
// internal raft state. As is, it's going to break with any new piece of
// local state that we add, and it introduces locking between two Replicas
// that don't ever need to interact.
var hs raftpb.HardState
if rightRepl != nil {
rightRepl.raftMu.Lock()
defer rightRepl.raftMu.Unlock()
// Assert that the rightRepl is not initialized. We're about to clear out
// the data of the RHS of the split; we cannot have already accepted a
// snapshot to initialize this newer RHS.
Expand Down

0 comments on commit f19bb2a

Please sign in to comment.