-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: unify replica addition and removal paths #39640
Conversation
887e4b5
to
06dde3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 6 of 6 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/replica_command.go, line 913 at r3 (raw file):
chg, ok := byNodeID[rDesc.NodeID] delete(byNodeID, rDesc.NodeID) if !ok || chg.ChangeType == roachpb.REMOVE_REPLICA {
nit here and below: chg.ChangeType != roachpb.ADD_REPLICA
would be a little easier to read because it would spell out the kinds of changes we care about in these loops.
pkg/storage/replica_command.go, line 947 at r3 (raw file):
} func (r *Replica) addAndRemoveReplicas(
This function could use a comment about the process we go through when adding and removing replicas. Something that mentions that a learner replica is created for each added replica one-by-one, they are sent a snapshot one-by-one, and then a single configuration change is run to promote all learners to voting replicas and remove all removed replicas in a single atomic step.
pkg/storage/replica_command.go, line 970 at r3 (raw file):
if !useLearners { // NB: we will never use atomic replication changes while learners are not // also active.
Add a TODO to return an error here once we remove the one above.
pkg/storage/replica_command.go, line 985 at r3 (raw file):
// Now move it to be a full voter (waiting on it to get a raft snapshot first, // so it's not immediately way behind).
This comment no longer lines up with the code.
pkg/storage/replica_command.go, line 1009 at r3 (raw file):
) (*roachpb.RangeDescriptor, error) { if len(chgs) == 0 { // If there's nothing to do, return early to avoid redundant work.
I'd either rename this function to maybeAddLearnerReplicas
/addAnyLearnerReplicas
or move this check to the caller. It would make the previous function's stages more clear.
pkg/storage/replica_command.go, line 1078 at r3 (raw file):
// switching between StateSnapshot and StateProbe. Even if we worked through // these, it would be susceptible to future similar issues. if err := r.sendSnapshot(ctx, rDesc, SnapshotRequest_LEARNER, priority); err != nil {
Do you see a world in which we parallelize this across all learners added in a single step?
pkg/storage/replica_command.go, line 1096 at r3 (raw file):
updatedDesc.SetReplicas(roachpb.MakeReplicaDescriptors(&newReplicas)) for _, chg := range removes { if _, found := updatedDesc.RemoveReplica(chg.Target.NodeID, chg.Target.StoreID); !found {
Can we modify newReplicas
instead using ReplicaDescriptors.RemoveReplica
and then only call updatedDesc.SetReplicas
once?
pkg/storage/replica_command.go, line 1221 at r3 (raw file):
} func execChangeReplicasTxn(
Something about how we update the roachpb.RangeDescriptor
above this function call but then also need to pass in the roachpb.ReplicationChanges
slice so we can construct var added, removed []roachpb.ReplicaDescriptor
from the updated descriptor seems weird to me. Would it be easier for both the caller and the function itself if the function took added, removed []roachpb.ReplicaDescriptor
arguments and performed the descriptor modification internally?
It looked suspicious that we were sending a snapshot using a descriptor that was marked as a voter. Ultimately, this info wasn't used, but for clarity make sure we only mutate the descriptor for the change replicas txn. Release note: None
Declutter the main method. Release note: None
This continues the reworking of the various replication change APIs with the goal of allowing a) testing of general atomic replication changes b) issuing replica swaps from the replicate queue (in 19.2). For previous steps, see: cockroachdb#39485 cockroachdb#39611 This change is not a pure plumbing PR. Instead, it unifies `(*Replica).addReplica` and `(*Replica).removeReplica` into a method that can do both, `(*Replica).addAndRemoveReplicas`. Given a slice of ReplicationChanges, this method first adds learner replicas corresponding to the desired new voters. After having sent snapshots to all of them, the method issues a configuration change that atomically - upgrades all learners to voters - removes any undesired replicas. Note that no atomic membership changes are *actually* carried out yet. This is because the callers of `addAndRemoveReplicas` pass in only a single change (i.e. an addition or removal), which the method also verifies. Three pieces are missing after this PR: First, we need to be able to instruct raft to carry out atomic configuration changes: https://github.com/cockroachdb/cockroach/blob/2e8db6ca53c59d3d281e64939f79d937195403d4/pkg/storage/replica_proposal_buf.go#L448-L451 which in particular requires being able to store the ConfState corresponding to a joint configuration in the unreplicated local state (under a new key). Second, we must pass the slice of changes handed to `AdminChangeReplicas` through to `addAndRemoveReplicas` without unrolling it first, see: https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica_command.go#L870-L891 and https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica.go#L1314-L1325 Third, we must to teach the replicate queue to issue the "atomic swaps"; this is the reason we're introducing atomic membership changes in the first place. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fast review! Addressed your comments and made some more cleanups, PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/replica_command.go, line 947 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This function could use a comment about the process we go through when adding and removing replicas. Something that mentions that a learner replica is created for each added replica one-by-one, they are sent a snapshot one-by-one, and then a single configuration change is run to promote all learners to voting replicas and remove all removed replicas in a single atomic step.
Done. I'm planning (not in this PR) to fold addAndRemoveReplicas
into ChangeReplicas
and will rework that larger comment there when I do, too.
pkg/storage/replica_command.go, line 970 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Add a TODO to return an error here once we remove the one above.
I just added the check now.
pkg/storage/replica_command.go, line 1009 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'd either rename this function to
maybeAddLearnerReplicas
/addAnyLearnerReplicas
or move this check to the caller. It would make the previous function's stages more clear.
Pulled this to the caller, PTAL.
pkg/storage/replica_command.go, line 1078 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do you see a world in which we parallelize this across all learners added in a single step?
Potentially at some point, not anytime soon though.
pkg/storage/replica_command.go, line 1096 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can we modify
newReplicas
instead usingReplicaDescriptors.RemoveReplica
and then only callupdatedDesc.SetReplicas
once?
I went the other way and completely removed newReplicas
. Instead we just maintain updatedDesc
from the beginning of the method. The result is simpler code.
pkg/storage/replica_command.go, line 1221 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Something about how we update the
roachpb.RangeDescriptor
above this function call but then also need to pass in theroachpb.ReplicationChanges
slice so we can constructvar added, removed []roachpb.ReplicaDescriptor
from the updated descriptor seems weird to me. Would it be easier for both the caller and the function itself if the function tookadded, removed []roachpb.ReplicaDescriptor
arguments and performed the descriptor modification internally?
Good idea, done. I still have an itch to provide a better abstraction around the mutation of a replica descriptor via a sequence of ReplicationChanges (to avoid manually keeping the two in sync) but in the interest of time I'll leave that to future me.
Not passing changes here also made it possible to streamline the learner addition (which now takes a slice of targets only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained
TFTR! |
Build failed (retrying...) |
^- flaked on #39670 |
39640: storage: unify replica addition and removal paths r=nvanbenschoten a=tbg This continues the reworking of the various replication change APIs with the goal of allowing a) testing of general atomic replication changes b) issuing replica swaps from the replicate queue (in 19.2). For previous steps, see: #39485 #39611 This change is not a pure plumbing PR. Instead, it unifies `(*Replica).addReplica` and `(*Replica).removeReplica` into a method that can do both, `(*Replica).addAndRemoveReplicas`. Given a slice of ReplicationChanges, this method first adds learner replicas corresponding to the desired new voters. After having sent snapshots to all of them, the method issues a configuration change that atomically - upgrades all learners to voters - removes any undesired replicas. Note that no atomic membership changes are *actually* carried out yet. This is because the callers of `addAndRemoveReplicas` pass in only a single change (i.e. an addition or removal), which the method also verifies. Three pieces are missing after this PR: First, we need to be able to instruct raft to carry out atomic configuration changes: https://github.com/cockroachdb/cockroach/blob/2e8db6ca53c59d3d281e64939f79d937195403d4/pkg/storage/replica_proposal_buf.go#L448-L451 which in particular requires being able to store the ConfState corresponding to a joint configuration in the unreplicated local state (under a new key). Second, we must pass the slice of changes handed to `AdminChangeReplicas` through to `addAndRemoveReplicas` without unrolling it first, see: https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica_command.go#L870-L891 and https://github.com/cockroachdb/cockroach/blob/3b316bac6ef342590ddc68d2989714d6e126371a/pkg/storage/replica.go#L1314-L1325 Third, we must to teach the replicate queue to issue the "atomic swaps"; this is the reason we're introducing atomic membership changes in the first place. Release note: None 39656: kv: init heartbeat txn log tag later r=nvanbenschoten a=tbg At init() time, the txn proto has not been populated yet. Found while investigating #39652. This change strikes me as clunky, but I don't have the bandwidth to dig deeper right now. Release note: None 39666: testutils/lint/passes: disable under nightly stress r=mjibson a=mjibson Under stress these error with "go build a: failed to cache compiled Go files". Fixes #39616 Fixes #39541 Fixes #39479 Release note: None 39669: rpc: use gRPC enforced minimum keepalive timeout r=knz a=ajwerner Before this commit we'd experience the following annoying log message from gRPC every time we create a new connection telling us that our setting is being ignored. ``` Adjusting keepalive ping interval to minimum period of 10s ``` Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Matt Jibson <matt.jibson@gmail.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Build succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/replica_command.go, line 972 at r6 (raw file):
useLearners := useLearnerReplicas.Get(&settings.SV) useLearners = useLearners && settings.Version.IsActive(cluster.VersionLearnerReplicas) if !useLearners {
how this this work with replica removal when useLearners is false? if it's something i'm not seeing, this definitely deserves a comment
I introduced a bug in cockroachdb#39640 which would fail removal of replicas whenever preemptive snapshots were used, since we'd accidentally send a preemptive snapshot which would end up with a "node already in descriptor" error. See: cockroachdb#39640 (review) Release note: None
Thanks for spotting this, here's a fix: #39697 |
I introduced a bug in cockroachdb#39640 which would fail removal of replicas whenever preemptive snapshots were used, since we'd accidentally send a preemptive snapshot which would end up with a "node already in descriptor" error. See: cockroachdb#39640 (review) Release note: None
I introduced a bug in cockroachdb#39640 which would fail removal of replicas whenever preemptive snapshots were used, since we'd accidentally send a preemptive snapshot which would end up with a "node already in descriptor" error. See: cockroachdb#39640 (review) Release note: None
I introduced a bug in cockroachdb#39640 which would fail removal of replicas whenever preemptive snapshots were used, since we'd accidentally send a preemptive snapshot which would end up with a "node already in descriptor" error. See: cockroachdb#39640 (review) Release note: None
39697: storage: fix replica removal when learners not active r=danhhz a=tbg I introduced a bug in #39640 which would fail removal of replicas whenever preemptive snapshots were used, since we'd accidentally send a preemptive snapshot which would end up with a "node already in descriptor" error. See: #39640 (review) Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
This continues the reworking of the various replication change APIs with
the goal of allowing a) testing of general atomic replication changes b)
issuing replica swaps from the replicate queue (in 19.2).
For previous steps, see:
#39485
#39611
This change is not a pure plumbing PR. Instead, it unifies
(*Replica).addReplica
and(*Replica).removeReplica
into a method thatcan do both,
(*Replica).addAndRemoveReplicas
.Given a slice of ReplicationChanges, this method first adds learner
replicas corresponding to the desired new voters. After having sent
snapshots to all of them, the method issues a configuration change that
atomically
Note that no atomic membership changes are actually carried out yet. This
is because the callers of
addAndRemoveReplicas
pass in only a singlechange (i.e. an addition or removal), which the method also verifies.
Three pieces are missing after this PR: First, we need to be able to
instruct raft to carry out atomic configuration changes:
cockroach/pkg/storage/replica_proposal_buf.go
Lines 448 to 451 in 2e8db6c
which in particular requires being able to store the ConfState
corresponding to a joint configuration in the unreplicated local state
(under a new key).
Second, we must pass the slice of changes handed to
AdminChangeReplicas
through toaddAndRemoveReplicas
without unrollingit first, see:
cockroach/pkg/storage/replica_command.go
Lines 870 to 891 in 3b316ba
and
cockroach/pkg/storage/replica.go
Lines 1314 to 1325 in 3b316ba
Third, we must to teach the replicate queue to issue the "atomic swaps";
this is the reason we're introducing atomic membership changes in the first
place.
Release note: None