kv: write replica tombstone on replica creation instead of removal #64966

nvanbenschoten · 2021-05-10T22:46:57Z

In the following TODO, we detail a hazard that can allow for the ID of a replica on a node to regress:

Lines 1510 to 1537 in 79fe0c7

    
           // TODO(ajwerner): Ideally if this store had ever learned that the replica 
        
           // created by the split were removed it would not forget that fact. 
        
           // There exists one edge case where the store may learn that it should house 
        
           // a replica of the same range with a higher replica ID and then forget. 
        
           // If the first raft message this store ever receives for the this range 
        
           // contains a replica ID higher than the replica ID in the split trigger 
        
           // then an in-memory replica at that higher replica ID will be created and 
        
           // no tombstone at a lower replica ID will be written. If the server then 
        
           // crashes it will forget that it had ever been the higher replica ID. The 
        
           // server may then proceed to process the split and initialize a replica at 
        
           // the replica ID implied by the split. This is potentially problematic as 
        
           // the replica may have voted as this higher replica ID and when it rediscovers 
        
           // the higher replica ID it will delete all of the state corresponding to the 
        
           // older replica ID including its hard state which may have been synthesized 
        
           // with votes as the newer replica ID. This case tends to be handled safely 
        
           // in practice because the replica should only be receiving messages as the 
        
           // newer replica ID after it has been added to the range. Prior to learner 
        
           // replicas we would only add a store to a range after we've successfully 
        
           // applied a pre-emptive snapshot. If the store were to split between the 
        
           // preemptive snapshot and the addition then the addition would fail due to 
        
           // the conditional put logic. If the store were to then enable learners then 
        
           // we're still okay because we won't promote a learner unless we succeed in 
        
           // sending a learner snapshot. If we fail to send the replica never becomes 
        
           // a voter then its votes don't matter and are safe to discard. 
        
           // 
        
           // Despite the safety due to the change replicas protocol explained above 
        
           // it'd be good to know for sure that a replica ID for a range on a store 
        
           // is always monotonically increasing, even across restarts.

The solution to this is described in the TODO and also in discussion in #40892. In that PR, @ajwerner suggested that:

I'm also curious about how y'all feel re: writing a tombstone in getOrCreateReplica when creating an uninitialized replica with a non-zero replica ID. That would provide the invariant that a store will always have a monotonically increasing set of non-zero replica IDs for a given range even across restart.

@bdarnell and @nvanbenschoten both liked the idea but felt it should be deferred to a later change. The time for that change has come.

The plan here is to write a tombstone (see setTombstoneKey) to box out replicas with IDs lower than a given replica when that replica is created, rather than (or in addition to?) when it is destroyed. This ensures that even if the replica is never initialized before a node restarts, a replica with a smaller ID (which is known to be old) cannot be created on the node upon a restart.

Fixing this will cause TestProcessSplitAfterRightHandSideHasBeenRemoved/(4)_initial_replica_RHS_partition,_with_restart to break, because the split will avoid creating a RHS replica with an ID below the ID in the tombstone.

Jira issue: CRDB-7367

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2022-10-20T11:32:32Z

This should be fixed by the RaftReplicaID work in #75761:

cockroach/pkg/kv/kvserver/store_create_replica.go

Lines 198 to 277 in e1c24ed

    
           // Initialize the Replica with the replicaID. 
        
           if err := func() error { 
        
           	// Check for a tombstone again now that we've inserted into the Range 
        
           	// map. This double-checked locking ensures that we avoid a race where a 
        
           	// replica is created and destroyed between the initial unsynchronized 
        
           	// tombstone check and the Range map linearization point. By checking 
        
           	// again now, we make sure to synchronize with any goroutine that wrote 
        
           	// a tombstone and then removed an old replica from the Range map. 
        
           	if ok, err := storage.MVCCGetProto( 
        
           		ctx, s.Engine(), tombstoneKey, hlc.Timestamp{}, &tombstone, storage.MVCCGetOptions{}, 
        
           	); err != nil { 
        
           		return err 
        
           	} else if ok && replicaID < tombstone.NextReplicaID { 
        
           		return &roachpb.RaftGroupDeletedError{} 
        
           	} 
        
           	// An uninitialized replica should have an empty HardState.Commit at 
        
           	// all times. Failure to maintain this invariant indicates corruption. 
        
           	// And yet, we have observed this in the wild. See #40213. 
        
           	if hs, err := repl.mu.stateLoader.LoadHardState(ctx, s.Engine()); err != nil { 
        
           		return err 
        
           	} else if hs.Commit != 0 { 
        
           		log.Fatalf(ctx, "found non-zero HardState.Commit on uninitialized replica %s. HS=%+v", repl, hs) 
        
           	} 
        
           	// Write the RaftReplicaID for this replica. This is the only place in the 
        
           	// CockroachDB code that we are creating a new *uninitialized* replica. 
        
           	// Note that it is possible that we have already created the HardState for 
        
           	// an uninitialized replica, then crashed, and on recovery are receiving a 
        
           	// raft message for the same or later replica. 
        
           	// - Same replica: we are overwriting the RaftReplicaID with the same 
        
           	//   value, which is harmless. 
        
           	// - Later replica: there may be an existing HardState for the older 
        
           	//   uninitialized replica with Commit=0 and non-zero Term and Vote. Using 
        
           	//   the Term and Vote values for that older replica in the context of 
        
           	//   this newer replica is harmless since it just limits the votes for 
        
           	//   this replica. 
        
           	// 
        
           	// 
        
           	// Compatibility: 
        
           	// - v21.2 and v22.1: v22.1 unilaterally introduces RaftReplicaID (an 
        
           	//   unreplicated range-id local key). If a v22.1 binary is rolled back at 
        
           	//   a node, the fact that RaftReplicaID was written is harmless to a 
        
           	//   v21.2 node since it does not read it. When a v21.2 drops an 
        
           	//   initialized range, the RaftReplicaID will also be deleted because the 
        
           	//   whole range-ID local key space is deleted. 
        
           	// 
        
           	// - v22.2: we will start relying on the presence of RaftReplicaID, and 
        
           	//   remove any unitialized replicas that have a HardState but no 
        
           	//   RaftReplicaID. This removal will happen in ReplicasStorage.Init and 
        
           	//   allow us to tighten invariants. Additionally, knowing the ReplicaID 
        
           	//   for an unitialized range could allow a node to somehow contact the 
        
           	//   raft group (say by broadcasting to all nodes in the cluster), and if 
        
           	//   the ReplicaID is stale, would allow the node to remove the HardState 
        
           	//   and RaftReplicaID. See 
        
           	//   https://github.com/cockroachdb/cockroach/issues/75740. 
        
           	// 
        
           	//   There is a concern that there could be some replica that survived 
        
           	//   from v21.2 to v22.1 to v22.2 in unitialized state and will be 
        
           	//   incorrectly removed in ReplicasStorage.Init causing the loss of the 
        
           	//   HardState.{Term,Vote} and lead to a "split-brain" wrt leader 
        
           	//   election. 
        
           	// 
        
           	//   Even though this seems theoretically possible, it is considered 
        
           	//   practically impossible, and not just because a replica's vote is 
        
           	//   unlikely to stay relevant across 2 upgrades. For one, we're always 
        
           	//   going through learners and don't promote until caught up, so 
        
           	//   uninitialized replicas generally never get to vote. Second, even if 
        
           	//   their vote somehow mattered (perhaps we sent a learner a snap which 
        
           	//   was not durably persisted - which we also know is impossible, but 
        
           	//   let's assume it - and then promoted the node and it immediately 
        
           	//   power-cycled, losing the snapshot) the fire-and-forget way in which 
        
           	//   raft votes are requested (in the same raft cycle) makes it extremely 
        
           	//   unlikely that the restarted node would then receive it. 
        
           	if err := repl.mu.stateLoader.SetRaftReplicaID(ctx, s.Engine(), replicaID); err != nil { 
        
           		return err 
        
           	} 
        
           	return repl.loadRaftMuLockedReplicaMuLocked(uninitializedDesc) 
        
           }(); err != nil {

nvanbenschoten added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels May 10, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

erikgrinaker added T-kv-replication and removed T-kv KV Team labels May 31, 2022

erikgrinaker closed this as completed Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: write replica tombstone on replica creation instead of removal #64966

kv: write replica tombstone on replica creation instead of removal #64966

nvanbenschoten commented May 10, 2021 •

edited by cockroach-jira-scripts

Loading

erikgrinaker commented Oct 20, 2022

kv: write replica tombstone on replica creation instead of removal #64966

kv: write replica tombstone on replica creation instead of removal #64966

Comments

nvanbenschoten commented May 10, 2021 • edited by cockroach-jira-scripts Loading

erikgrinaker commented Oct 20, 2022

nvanbenschoten commented May 10, 2021 •

edited by cockroach-jira-scripts

Loading