-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: use atomic replication changes in RelocateRange #41084
Conversation
pkg/storage/replica_command.go
Outdated
transferLease(*leaseTarget) | ||
} | ||
|
||
ctx = context.WithValue(ctx, "testing", "testing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Need to fix this hack; without it AdminChangeReplicas will run the changes one by one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I did something.
58f3841
to
801cc5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more straightforward that I was thinking it would be
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)
pkg/storage/client_relocate_range_test.go, line 36 at r1 (raw file):
require.NoError(t, tc.Servers[0].DB().AdminRelocateRange(ctx, startKey.AsRawKey(), targets)) var desc roachpb.RangeDescriptor require.NoError(t, tc.Servers[0].DB().GetProto(
why not tc.Servers[0].LookupRange
?
pkg/storage/client_relocate_range_test.go, line 168 at r1 (raw file):
// A grab bag. { // s3 -(add)-> s3 s2 -(swap)-> s4 s2 -(add)-> s4 s2 s1
nit: the last part of this comment doesn't match
pkg/storage/client_relocate_range_test.go, line 172 at r1 (raw file):
relocateAndCheck(t, tc, k, tc.Targets(1, 3, 0)) }) // s2 s4 s1 -(add)-> s2 s4 s1 s6
nit: the last part of this comment doesn't match
pkg/storage/replica_command.go, line 2020 at r1 (raw file):
} for _, substr := range whitelist { if strings.Contains(err.Error(), substr) {
nit: probably better to be defensive and call err.Error()
once
pkg/storage/replica_command.go, line 2064 at r1 (raw file):
} ops, leaseTarget, err := s.relocateOne(ctx, &rangeDesc, targets)
check this err
pkg/storage/replica_command.go, line 2077 at r1 (raw file):
ctx = context.WithValue(ctx, "testing", "testing") newDesc, err := s.DB().AdminChangeReplicas(
nit: seems like this should fit on one line
pkg/storage/replica_command.go, line 2085 at r1 (raw file):
} if every.ShouldLog() { log.Warning(ctx, returnErr)
Warning seems a bit high for this given the "descriptor changed" stuff
pkg/storage/replica_command.go, line 2087 at r1 (raw file):
log.Warning(ctx, returnErr) } re.Next()
nit: move this into the for {
?
pkg/storage/replica_command.go, line 2105 at r1 (raw file):
rangeReplicas := desc.Replicas().All() if len(rangeReplicas) != len(desc.Replicas().Voters()) { return nil, nil, crdberrors.AssertionFailedf(
looks like you lost the comment pointing out why this is safe to assert
pkg/storage/replica_command.go, line 2247 at r1 (raw file):
liReq.Key = desc.StartKey.AsRawKey() b.AddRawRequest(liReq) if err := s.DB().Run(ctx, &b); err != nil {
we don't already have a helper for this?
pkg/storage/replica_command.go, line 2012 at r2 (raw file):
// // TODO(tbg): remove in 20.1. ctx = context.WithValue(ctx, "testing", "testing")
we should update this value name now that's it's not just testing
pkg/storage/replica_command.go, line 2091 at r2 (raw file):
} // When a swap is in order but we don't have atomic replication changes, // run the ops one by one.
do we need this? ChangeReplicas
seems to unroll them
801cc5c
to
b317bd4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! This is basically ready to be merged I think, but I made enough changes to wait for another pass. I've also noticed that the test I added sometimes takes ~1.5s but other times 11s.. looking into why, it seems that there are unexplained pauses sometimes, such as this one:
I190926 09:07:22.951158 4200 storage/replica_command.go:2075 [n4,s4,r25/9:/{Table/Max-Max}] TBG ====== LOOP START
I190926 09:07:22.951197 4200 storage/replica_command.go:2077 [n4,s4,r25/9:/{Table/Max-Max}] TBG ====== ATTEMPT START
I190926 09:07:22.952095 4303 storage/replica_command.go:1583 [n3,s3,r25/12:/{Table/Max-Max}] change replicas (add [] remove [(n2,s2):8]): existing descriptor r25:/{Table/Max-Max} [(n5,s5):13, (n2,s2):8, (n3,s3):12, (n6,s6):11, next=14, gen=34, sticky=9223372036.854775807,2147483647]
I190926 09:07:25.843486 4303 storage/replica_raft.go:291 [n3,s3,r25/12:/{Table/Max-Max}] proposing REMOVE_REPLICA[(n2,s2):8]: after=[(n5,s5):13 (n6,s6):11 (n3,s3):12] next=14
I190926 09:07:25.844452 1120 storage/store.go:2660 [n2,s2,r25/8:/{Table/Max-Max}] removing replica r25/8
I'm going to spend some time later today figuring out what those are about, there's probably an opportunity to make replication more snappy across the board.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz)
pkg/storage/client_relocate_range_test.go, line 168 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: the last part of this comment doesn't match
Added the reordered forms (I intentionally didn't reorder so that each step would be easier to follow along with).
pkg/storage/replica_command.go, line 2085 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
Warning seems a bit high for this given the "descriptor changed" stuff
That's the old code, but I agree. Downgraded.
pkg/storage/replica_command.go, line 2087 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: move this into the
for {
?
I rearranged this a bit -- the for loop was both "do a thing at a time until no more things need doing" and "retry the current thing with a backoff". Pulled those apart so we now have one more layer of for loops and more idiomatic-looking backoff code.
pkg/storage/replica_command.go, line 2247 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
we don't already have a helper for this?
Not that I'm aware, I've seen helpers only in the testcluster machinery.
pkg/storage/replica_command.go, line 2091 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
do we need this?
ChangeReplicas
seems to unroll them
No, added a comment
// When a swap is in order but we're not sure that all nodes are running
// 19.2+ (in which the AdminChangeReplicas RPC was extended to support
// mixing additions and removals), don't send such requests but unroll
// the ops here, running them one by one. 19.2 nodes that don't have the
// cluster version enabled would unroll them internally, but 19.1 nodes
// don't even understand the RPC and would interpret all ops as
// additions (which would fail since we can't add an existing node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I figured it out. TL;DR is that we sometimes end up removing the raft leader and then we have to wait out the 3s election timeout. Filed #41122 to fix this.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)
pkg/storage/replica_command.go, line 2091 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
No, added a comment
// When a swap is in order but we're not sure that all nodes are running // 19.2+ (in which the AdminChangeReplicas RPC was extended to support // mixing additions and removals), don't send such requests but unroll // the ops here, running them one by one. 19.2 nodes that don't have the // cluster version enabled would unroll them internally, but 19.1 nodes // don't even understand the RPC and would interpret all ops as // additions (which would fail since we can't add an existing node).
Aha, yes, good to have this comment
pkg/storage/replica_command.go, line 2075 at r3 (raw file):
every := log.Every(time.Minute) for { for re := retry.StartWithCtx(ctx, retry.Options{MaxBackoff: 5 * time.Second}); ; re.Next() {
I suspect some of the control flow gets easier to follow (breaks and success bool etc) if this inner loop moves into a method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r1, 5 of 5 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)
pkg/internal/client/db.go, line 564 at r3 (raw file):
// ChangeReplicasCanMixAddAndRemoveContext convinces // (*client.DB).AdminChangeReplicas of that the caller is aware that 19.1 nodes
s/of //
pkg/storage/client_relocate_range_test.go, line 108 at r3 (raw file):
intercepted = append(intercepted, intercept{ ops: ops, leaseTarget: leaseTarget,
We don't seem to ever check this. Should we be? I don't see us checking the leaseholder anywhere else, but the comments imply that we know who it will become.
pkg/storage/client_relocate_range_test.go, line 138 at r3 (raw file):
swap out replicas one by one
Isn't this doing the opposite? Or do you mean that it's performing three atomic replica changes, one by one?
pkg/storage/replica_command.go, line 2087 at r3 (raw file):
// Done. // Step 3: Transfer the lease to the first listed target, as the API specifies. transferLease(targets[0])
Should we just make relocateOne
return this as the leaseTarget
in this case?
pkg/storage/replica_command.go, line 2103 at r3 (raw file):
opss := [][]roachpb.ReplicationChange{ops} if !useAtomic && len(ops) == 2 {
Why only if len(ops) == 2
? If I'm reading the comment above, we'd want to do this whenever !useAtomic
.
7e5a515
to
f60f07e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got errors under stressrace related to verifying the lease placement which I fixed; seems stable on my gceworker. In 19.2 we should look into making sure the lease transfers become mandatory, right now relocate will just fail weirdly if any of them don't go through (but oddly they seem to have a 100% success rate).
Thanks for the reviews!
bors r=danhhz
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)
pkg/storage/client_relocate_range_test.go, line 108 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We don't seem to ever check this. Should we be? I don't see us checking the leaseholder anywhere else, but the comments imply that we know who it will become.
The test verifies that the lease ends up in the right place (see requireLeaseAt
), it's hard to predict the exact lease transfers in between since the allocator ultimately gets to decide what order things happen in, and the allocator tends to be... not deterministic.
pkg/storage/client_relocate_range_test.go, line 138 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
swap out replicas one by one
Isn't this doing the opposite? Or do you mean that it's performing three atomic replica changes, one by one?
Yep. Clarified the comment.
pkg/storage/replica_command.go, line 2075 at r3 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
I suspect some of the control flow gets easier to follow (breaks and success bool etc) if this inner loop moves into a method
Maybe, but the return err
paths get more awkward. I'm going to leave as is for now since the whole opss
thing goes away in 19.2 and with it the weird control flow.
pkg/storage/replica_command.go, line 2103 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why only if
len(ops) == 2
? If I'm reading the comment above, we'd want to do this whenever!useAtomic
.
len(ops)==2
is either 1
or 2
and if it's 1
then we shouldn't make opss
have an extra nil slice at the end. I added an assertion just above to make sure this is clear to readers.
41084: storage: use atomic replication changes in RelocateRange r=danhhz a=tbg I wrote this up rather quickly, but wanted to get this out for review sooner rather than later. ---- Touches #12768. Release justification: the previous code would enter vulnerable replication configurations when it wasn't necessary, thus undermining what we wanted to achieve in #12768. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Build failed |
Touches cockroachdb#12768. Release justification: the previous code would enter vulnerable replication configurations when it wasn't necessary, thus undermining what we wanted to achieve in cockroachdb#12768. PS: the newly added test passes when manually overriding useAtomic with `false`, and I verified that it indeed uses no atomic replication changes. Release note: None
f60f07e
to
d9412d3
Compare
Linters were unhappy bors r=danhhz |
41084: storage: use atomic replication changes in RelocateRange r=danhhz a=tbg I wrote this up rather quickly, but wanted to get this out for review sooner rather than later. ---- Touches #12768. Release justification: the previous code would enter vulnerable replication configurations when it wasn't necessary, thus undermining what we wanted to achieve in #12768. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Build succeeded |
I wrote this up rather quickly, but wanted to get this out for review sooner
rather than later.
Touches #12768.
Release justification: the previous code would enter vulnerable
replication configurations when it wasn't necessary, thus undermining
what we wanted to achieve in #12768.
Release note: None