storage: unify replica addition and removal paths #39640

tbg · 2019-08-13T20:54:29Z

This continues the reworking of the various replication change APIs with
the goal of allowing a) testing of general atomic replication changes b)
issuing replica swaps from the replicate queue (in 19.2).

For previous steps, see:

#39485
#39611

This change is not a pure plumbing PR. Instead, it unifies
(*Replica).addReplica and (*Replica).removeReplica into a method that
can do both, (*Replica).addAndRemoveReplicas.

Given a slice of ReplicationChanges, this method first adds learner
replicas corresponding to the desired new voters. After having sent
snapshots to all of them, the method issues a configuration change that
atomically

upgrades all learners to voters
removes any undesired replicas.

Note that no atomic membership changes are actually carried out yet. This
is because the callers of addAndRemoveReplicas pass in only a single
change (i.e. an addition or removal), which the method also verifies.

Three pieces are missing after this PR: First, we need to be able to
instruct raft to carry out atomic configuration changes:

cockroach/pkg/storage/replica_proposal_buf.go

Lines 448 to 451 in 2e8db6c

    
           added, removed := crt.Added(), crt.Removed() 
        
           if len(added)+len(removed) != 1 { 
        
           	log.Fatalf(context.TODO(), "atomic replication changes not supported yet") 
        
           }

which in particular requires being able to store the ConfState
corresponding to a joint configuration in the unreplicated local state
(under a new key).

Second, we must pass the slice of changes handed to
AdminChangeReplicas through to addAndRemoveReplicas without unrolling
it first, see:

cockroach/pkg/storage/replica_command.go

Lines 870 to 891 in 3b316ba

    
           func (r *Replica) ChangeReplicas( 
        
           	ctx context.Context, 
        
           	changeType roachpb.ReplicaChangeType, 
        
           	target roachpb.ReplicationTarget, 
        
           	desc *roachpb.RangeDescriptor, 
        
           	reason storagepb.RangeLogEventReason, 
        
           	details string, 
        
           ) (updatedDesc *roachpb.RangeDescriptor, _ error) { 
        
           	if desc == nil { 
        
           		return nil, errors.Errorf("%s: the current RangeDescriptor must not be nil", r) 
        
           	} 
        
           	switch changeType { 
        
           	case roachpb.ADD_REPLICA: 
        
           		return r.addReplica(ctx, target, desc, SnapshotRequest_REBALANCE, reason, details) 
        
           	case roachpb.REMOVE_REPLICA: 
        
           		return r.removeReplica(ctx, target, desc, SnapshotRequest_REBALANCE, reason, details) 
        
           	default: 
        
           		return nil, errors.Errorf(`unknown change type: %s`, changeType) 
        
           	} 
        
           }

and

cockroach/pkg/storage/replica.go

Lines 1314 to 1325 in 3b316ba

    
           case *roachpb.AdminChangeReplicasRequest: 
        
           	var err error 
        
           	expDesc := &tArgs.ExpDesc 
        
           	for _, chg := range tArgs.Changes() { 
        
           		// Update expDesc to the outcome of the previous run to enable detection 
        
           		// of concurrent updates while applying a series of changes. 
        
           		expDesc, err = r.ChangeReplicas( 
        
           			ctx, chg.ChangeType, chg.Target, expDesc, storagepb.ReasonAdminRequest, "") 
        
           		if err != nil { 
        
           			break 
        
           		} 
        
           	}

Third, we must to teach the replicate queue to issue the "atomic swaps";
this is the reason we're introducing atomic membership changes in the first
place.

Release note: None

cockroach-teamcity · 2019-08-13T20:54:38Z

This change is

nvanbenschoten

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 6 of 6 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/storage/replica_command.go, line 913 at r3 (raw file):

		chg, ok := byNodeID[rDesc.NodeID]
		delete(byNodeID, rDesc.NodeID)
		if !ok || chg.ChangeType == roachpb.REMOVE_REPLICA {

nit here and below: chg.ChangeType != roachpb.ADD_REPLICA would be a little easier to read because it would spell out the kinds of changes we care about in these loops.

pkg/storage/replica_command.go, line 947 at r3 (raw file):

}

func (r *Replica) addAndRemoveReplicas(

This function could use a comment about the process we go through when adding and removing replicas. Something that mentions that a learner replica is created for each added replica one-by-one, they are sent a snapshot one-by-one, and then a single configuration change is run to promote all learners to voting replicas and remove all removed replicas in a single atomic step.

pkg/storage/replica_command.go, line 970 at r3 (raw file):

	if !useLearners {
		// NB: we will never use atomic replication changes while learners are not
		// also active.

Add a TODO to return an error here once we remove the one above.

pkg/storage/replica_command.go, line 985 at r3 (raw file):

	// Now move it to be a full voter (waiting on it to get a raft snapshot first,
	// so it's not immediately way behind).

This comment no longer lines up with the code.

pkg/storage/replica_command.go, line 1009 at r3 (raw file):

) (*roachpb.RangeDescriptor, error) {
	if len(chgs) == 0 {
		// If there's nothing to do, return early to avoid redundant work.

I'd either rename this function to maybeAddLearnerReplicas/addAnyLearnerReplicas or move this check to the caller. It would make the previous function's stages more clear.

pkg/storage/replica_command.go, line 1078 at r3 (raw file):

			// switching between StateSnapshot and StateProbe. Even if we worked through
			// these, it would be susceptible to future similar issues.
			if err := r.sendSnapshot(ctx, rDesc, SnapshotRequest_LEARNER, priority); err != nil {

Do you see a world in which we parallelize this across all learners added in a single step?

pkg/storage/replica_command.go, line 1096 at r3 (raw file):

	updatedDesc.SetReplicas(roachpb.MakeReplicaDescriptors(&newReplicas))
	for _, chg := range removes {
		if _, found := updatedDesc.RemoveReplica(chg.Target.NodeID, chg.Target.StoreID); !found {

Can we modify newReplicas instead using ReplicaDescriptors.RemoveReplica and then only call updatedDesc.SetReplicas once?

pkg/storage/replica_command.go, line 1221 at r3 (raw file):

}

func execChangeReplicasTxn(

Something about how we update the roachpb.RangeDescriptor above this function call but then also need to pass in the roachpb.ReplicationChanges slice so we can construct var added, removed []roachpb.ReplicaDescriptor from the updated descriptor seems weird to me. Would it be easier for both the caller and the function itself if the function took added, removed []roachpb.ReplicaDescriptor arguments and performed the descriptor modification internally?

It looked suspicious that we were sending a snapshot using a descriptor that was marked as a voter. Ultimately, this info wasn't used, but for clarity make sure we only mutate the descriptor for the change replicas txn. Release note: None

Declutter the main method. Release note: None

This continues the reworking of the various replication change APIs with the goal of allowing a) testing of general atomic replication changes b) issuing replica swaps from the replicate queue (in 19.2). For previous steps, see: cockroachdb#39485 cockroachdb#39611 This change is not a pure plumbing PR. Instead, it unifies `(*Replica).addReplica` and `(*Replica).removeReplica` into a method that can do both, `(*Replica).addAndRemoveReplicas`. Given a slice of ReplicationChanges, this method first adds learner replicas corresponding to the desired new voters. After having sent snapshots to all of them, the method issues a configuration change that atomically - upgrades all learners to voters - removes any undesired replicas. Note that no atomic membership changes are *actually* carried out yet. This is because the callers of `addAndRemoveReplicas` pass in only a single change (i.e. an addition or removal), which the method also verifies. Three pieces are missing after this PR: First, we need to be able to instruct raft to carry out atomic configuration changes: https://github.com/cockroachdb/cockroach/blob/2e8db6ca53c59d3d281e64939f79d937195403d4/pkg/storage/replica_proposal_buf.go#L448-L451 which in particular requires being able to store the ConfState corresponding to a joint configuration in the unreplicated local state (under a new key). Second, we must pass the slice of changes handed to `AdminChangeReplicas` through to `addAndRemoveReplicas` without unrolling it first, see: https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica_command.go#L870-L891 and https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica.go#L1314-L1325 Third, we must to teach the replicate queue to issue the "atomic swaps"; this is the reason we're introducing atomic membership changes in the first place. Release note: None

tbg

Thanks for the fast review! Addressed your comments and made some more cleanups, PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/replica_command.go, line 947 at r3 (raw file):