storage: use atomic replication changes in RelocateRange #41084

tbg · 2019-09-25T16:46:12Z

I wrote this up rather quickly, but wanted to get this out for review sooner
rather than later.

Touches #12768.

Release justification: the previous code would enter vulnerable
replication configurations when it wasn't necessary, thus undermining
what we wanted to achieve in #12768.

Release note: None

cockroach-teamcity · 2019-09-25T16:46:18Z

This change is

tbg · 2019-09-25T16:46:51Z

pkg/storage/replica_command.go

+			transferLease(*leaseTarget)
+		}
+
+		ctx = context.WithValue(ctx, "testing", "testing")


TODO: Need to fix this hack; without it AdminChangeReplicas will run the changes one by one.

Ok, I did something.

danhhz

more straightforward that I was thinking it would be

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

pkg/storage/client_relocate_range_test.go, line 36 at r1 (raw file):

	require.NoError(t, tc.Servers[0].DB().AdminRelocateRange(ctx, startKey.AsRawKey(), targets))
	var desc roachpb.RangeDescriptor
	require.NoError(t, tc.Servers[0].DB().GetProto(

why not tc.Servers[0].LookupRange?

pkg/storage/client_relocate_range_test.go, line 168 at r1 (raw file):

	// A grab bag.
	{
		// s3 -(add)-> s3 s2 -(swap)-> s4 s2 -(add)-> s4 s2 s1

nit: the last part of this comment doesn't match

pkg/storage/client_relocate_range_test.go, line 172 at r1 (raw file):

			relocateAndCheck(t, tc, k, tc.Targets(1, 3, 0))
		})
		// s2 s4 s1 -(add)-> s2 s4 s1 s6

nit: the last part of this comment doesn't match

pkg/storage/replica_command.go, line 2020 at r1 (raw file):

		}
		for _, substr := range whitelist {
			if strings.Contains(err.Error(), substr) {

nit: probably better to be defensive and call err.Error() once

pkg/storage/replica_command.go, line 2064 at r1 (raw file):

		}

		ops, leaseTarget, err := s.relocateOne(ctx, &rangeDesc, targets)

check this err

pkg/storage/replica_command.go, line 2077 at r1 (raw file):

		ctx = context.WithValue(ctx, "testing", "testing")
		newDesc, err := s.DB().AdminChangeReplicas(

nit: seems like this should fit on one line

pkg/storage/replica_command.go, line 2085 at r1 (raw file):

			}
			if every.ShouldLog() {
				log.Warning(ctx, returnErr)

Warning seems a bit high for this given the "descriptor changed" stuff

pkg/storage/replica_command.go, line 2087 at r1 (raw file):

				log.Warning(ctx, returnErr)
			}
			re.Next()

nit: move this into the for {?

pkg/storage/replica_command.go, line 2105 at r1 (raw file):

	rangeReplicas := desc.Replicas().All()
	if len(rangeReplicas) != len(desc.Replicas().Voters()) {
		return nil, nil, crdberrors.AssertionFailedf(

looks like you lost the comment pointing out why this is safe to assert

pkg/storage/replica_command.go, line 2247 at r1 (raw file):

		liReq.Key = desc.StartKey.AsRawKey()
		b.AddRawRequest(liReq)
		if err := s.DB().Run(ctx, &b); err != nil {

we don't already have a helper for this?

pkg/storage/replica_command.go, line 2012 at r2 (raw file):

		//
		// TODO(tbg): remove in 20.1.
		ctx = context.WithValue(ctx, "testing", "testing")

we should update this value name now that's it's not just testing

pkg/storage/replica_command.go, line 2091 at r2 (raw file):

		}
		// When a swap is in order but we don't have atomic replication changes,
		// run the ops one by one.

do we need this? ChangeReplicas seems to unroll them

tbg

TFTR! This is basically ready to be merged I think, but I made enough changes to wait for another pass. I've also noticed that the test I added sometimes takes ~1.5s but other times 11s.. looking into why, it seems that there are unexplained pauses sometimes, such as this one:

I190926 09:07:22.951158 4200 storage/replica_command.go:2075  [n4,s4,r25/9:/{Table/Max-Max}] TBG ====== LOOP START
I190926 09:07:22.951197 4200 storage/replica_command.go:2077  [n4,s4,r25/9:/{Table/Max-Max}]        TBG ====== ATTEMPT START
I190926 09:07:22.952095 4303 storage/replica_command.go:1583  [n3,s3,r25/12:/{Table/Max-Max}] change replicas (add [] remove [(n2,s2):8]): existing descriptor r25:/{Table/Max-Max} [(n5,s5):13, (n2,s2):8, (n3,s3):12, (n6,s6):11, next=14, gen=34, sticky=9223372036.854775807,2147483647]
I190926 09:07:25.843486 4303 storage/replica_raft.go:291  [n3,s3,r25/12:/{Table/Max-Max}] proposing REMOVE_REPLICA[(n2,s2):8]: after=[(n5,s5):13 (n6,s6):11 (n3,s3):12] next=14
I190926 09:07:25.844452 1120 storage/store.go:2660  [n2,s2,r25/8:/{Table/Max-Max}] removing replica r25/8

I'm going to spend some time later today figuring out what those are about, there's probably an opportunity to make replication more snappy across the board.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz)

pkg/storage/client_relocate_range_test.go, line 168 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: the last part of this comment doesn't match

Added the reordered forms (I intentionally didn't reorder so that each step would be easier to follow along with).

pkg/storage/replica_command.go, line 2085 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Warning seems a bit high for this given the "descriptor changed" stuff

That's the old code, but I agree. Downgraded.

pkg/storage/replica_command.go, line 2087 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: move this into the for {?

I rearranged this a bit -- the for loop was both "do a thing at a time until no more things need doing" and "retry the current thing with a backoff". Pulled those apart so we now have one more layer of for loops and more idiomatic-looking backoff code.

pkg/storage/replica_command.go, line 2247 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

we don't already have a helper for this?

Not that I'm aware, I've seen helpers only in the testcluster machinery.

pkg/storage/replica_command.go, line 2091 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

do we need this? ChangeReplicas seems to unroll them

No, added a comment

	// When a swap is in order but we're not sure that all nodes are running
	// 19.2+ (in which the AdminChangeReplicas RPC was extended to support
	// mixing additions and removals), don't send such requests but unroll
	// the ops here, running them one by one. 19.2 nodes that don't have the
	// cluster version enabled would unroll them internally, but 19.1 nodes
	// don't even understand the RPC and would interpret all ops as
	// additions (which would fail since we can't add an existing node).

tbg

Ok, I figured it out. TL;DR is that we sometimes end up removing the raft leader and then we have to wait out the 3s election timeout. Filed #41122 to fix this.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz)

danhhz

still

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

pkg/storage/replica_command.go, line 2091 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

No, added a comment

	// When a swap is in order but we're not sure that all nodes are running
	// 19.2+ (in which the AdminChangeReplicas RPC was extended to support
	// mixing additions and removals), don't send such requests but unroll
	// the ops here, running them one by one. 19.2 nodes that don't have the
	// cluster version enabled would unroll them internally, but 19.1 nodes
	// don't even understand the RPC and would interpret all ops as
	// additions (which would fail since we can't add an existing node).

Aha, yes, good to have this comment

pkg/storage/replica_command.go, line 2075 at r3 (raw file):

	every := log.Every(time.Minute)
	for {
		for re := retry.StartWithCtx(ctx, retry.Options{MaxBackoff: 5 * time.Second}); ; re.Next() {

I suspect some of the control flow gets easier to follow (breaks and success bool etc) if this inner loop moves into a method

nvanbenschoten

Reviewed 1 of 3 files at r1, 5 of 5 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

pkg/internal/client/db.go, line 564 at r3 (raw file):

// ChangeReplicasCanMixAddAndRemoveContext convinces
// (*client.DB).AdminChangeReplicas of that the caller is aware that 19.1 nodes

s/of //

pkg/storage/client_relocate_range_test.go, line 108 at r3 (raw file):

				intercepted = append(intercepted, intercept{
					ops:         ops,
					leaseTarget: leaseTarget,

We don't seem to ever check this. Should we be? I don't see us checking the leaseholder anywhere else, but the comments imply that we know who it will become.

pkg/storage/client_relocate_range_test.go, line 138 at r3 (raw file):

swap out replicas one by one

Isn't this doing the opposite? Or do you mean that it's performing three atomic replica changes, one by one?

pkg/storage/replica_command.go, line 2087 at r3 (raw file):

				// Done.
				// Step 3: Transfer the lease to the first listed target, as the API specifies.
				transferLease(targets[0])

Should we just make relocateOne return this as the leaseTarget in this case?

pkg/storage/replica_command.go, line 2103 at r3 (raw file):

			opss := [][]roachpb.ReplicationChange{ops}
			if !useAtomic && len(ops) == 2 {

Why only if len(ops) == 2? If I'm reading the comment above, we'd want to do this whenever !useAtomic.

tbg

Got errors under stressrace related to verifying the lease placement which I fixed; seems stable on my gceworker. In 19.2 we should look into making sure the lease transfers become mandatory, right now relocate will just fail weirdly if any of them don't go through (but oddly they seem to have a 100% success rate).

Thanks for the reviews!

bors r=danhhz

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)

pkg/storage/client_relocate_range_test.go, line 108 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We don't seem to ever check this. Should we be? I don't see us checking the leaseholder anywhere else, but the comments imply that we know who it will become.

The test verifies that the lease ends up in the right place (see requireLeaseAt), it's hard to predict the exact lease transfers in between since the allocator ultimately gets to decide what order things happen in, and the allocator tends to be... not deterministic.

pkg/storage/client_relocate_range_test.go, line 138 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

swap out replicas one by one
Isn't this doing the opposite? Or do you mean that it's performing three atomic replica changes, one by one?

Yep. Clarified the comment.

pkg/storage/replica_command.go, line 2075 at r3 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I suspect some of the control flow gets easier to follow (breaks and success bool etc) if this inner loop moves into a method

Maybe, but the return err paths get more awkward. I'm going to leave as is for now since the whole opss thing goes away in 19.2 and with it the weird control flow.

pkg/storage/replica_command.go, line 2103 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why only if len(ops) == 2? If I'm reading the comment above, we'd want to do this whenever !useAtomic.

len(ops)==2 is either 1 or 2 and if it's 1 then we shouldn't make opss have an extra nil slice at the end. I added an assertion just above to make sure this is clear to readers.

41084: storage: use atomic replication changes in RelocateRange r=danhhz a=tbg I wrote this up rather quickly, but wanted to get this out for review sooner rather than later. ---- Touches #12768. Release justification: the previous code would enter vulnerable replication configurations when it wasn't necessary, thus undermining what we wanted to achieve in #12768. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

craig · 2019-09-27T10:33:54Z

Build failed

GitHub CI (Cockroach)

Touches cockroachdb#12768. Release justification: the previous code would enter vulnerable replication configurations when it wasn't necessary, thus undermining what we wanted to achieve in cockroachdb#12768. PS: the newly added test passes when manually overriding useAtomic with `false`, and I verified that it indeed uses no atomic replication changes. Release note: None

tbg · 2019-09-27T12:29:05Z

Linters were unhappy

bors r=danhhz

41084: storage: use atomic replication changes in RelocateRange r=danhhz a=tbg I wrote this up rather quickly, but wanted to get this out for review sooner rather than later. ---- Touches #12768. Release justification: the previous code would enter vulnerable replication configurations when it wasn't necessary, thus undermining what we wanted to achieve in #12768. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

craig · 2019-09-27T13:05:41Z

Build succeeded

GitHub CI (Cockroach)

tbg requested a review from danhhz September 25, 2019 16:46

tbg commented Sep 25, 2019

View reviewed changes

tbg force-pushed the atomic/relocate branch from 58f3841 to 801cc5c Compare September 25, 2019 17:00

danhhz reviewed Sep 25, 2019

View reviewed changes

tbg force-pushed the atomic/relocate branch from 801cc5c to b317bd4 Compare September 26, 2019 09:09

tbg requested a review from danhhz September 26, 2019 09:10

tbg commented Sep 26, 2019

View reviewed changes

tbg mentioned this pull request Sep 26, 2019

storage: campaign eagerly upon removing raft leader #41122

Merged

tbg commented Sep 26, 2019

View reviewed changes

tbg mentioned this pull request Sep 26, 2019

stability: Rebalances must be atomic #12768

Closed

danhhz reviewed Sep 26, 2019

View reviewed changes

nvanbenschoten reviewed Sep 26, 2019

View reviewed changes

tbg force-pushed the atomic/relocate branch 6 times, most recently from 7e5a515 to f60f07e Compare September 27, 2019 09:54

tbg requested review from danhhz and nvanbenschoten September 27, 2019 10:13

tbg commented Sep 27, 2019

View reviewed changes

tbg force-pushed the atomic/relocate branch from f60f07e to d9412d3 Compare September 27, 2019 12:28

craig bot merged commit d9412d3 into cockroachdb:master Sep 27, 2019

tbg mentioned this pull request Sep 27, 2019

storage: TestReplicateRemovedNodeDisruptiveElection fails #41165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: use atomic replication changes in RelocateRange #41084

storage: use atomic replication changes in RelocateRange #41084

tbg commented Sep 25, 2019

cockroach-teamcity commented Sep 25, 2019

tbg Sep 25, 2019

tbg Sep 25, 2019

danhhz left a comment

tbg left a comment

tbg left a comment

danhhz left a comment

nvanbenschoten left a comment

tbg left a comment

craig bot commented Sep 27, 2019

tbg commented Sep 27, 2019

craig bot commented Sep 27, 2019

storage: use atomic replication changes in RelocateRange #41084

storage: use atomic replication changes in RelocateRange #41084

Conversation

tbg commented Sep 25, 2019

cockroach-teamcity commented Sep 25, 2019

tbg Sep 25, 2019

Choose a reason for hiding this comment

tbg Sep 25, 2019

Choose a reason for hiding this comment

danhhz left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

danhhz left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

craig bot commented Sep 27, 2019

Build failed

tbg commented Sep 27, 2019

craig bot commented Sep 27, 2019

Build succeeded