Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: ensure Replica objects never change replicaID #40751

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Sep 13, 2019

WARNING: this change needs more testing to target the changes it makes
and at least some of the disabled tests should be reworked. This is a big
and scary change at this point in the cycle so I'm getting it out before
I'm really happy with it. There are some known TODOs.

On the plus side it does not seem to reproduce any crashes in hours with the
partitionccl.TestRepartitioning which readily produces crashes on master
when run under roachprod stress within ~20 minutes.

We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

  • Once a Replica object has a non-zero Replica.mu.replicaID it will not
    change.
  • If a raft message or snapshot addressed to a higher replica ID is received
    the current replica will be removed completely.
  • If a replica sees a ChangeReplicasTrigger which removes it then it
    completely removes itself while applying that command.
  • Replica.mu.destroyStatus is used to meaningfully signify the removal state
    of a Replica. Replicas about to be synchronously removed are in
    destroyReasonRemovalPending.
  • The queues are now replica ID aware. If a replica was added to the queue
    and the replica found when trying to pop are not the same and we knew the
    replica ID of replica when we added it then we should not process it.

This hopefully gives us some new invariants:

  • There is only ever at most 1 *Replica which IsAlive() for a range on a Store
    at a time.
  • Once a *Replica has a non-zero ReplicaID it never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes #40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.

@ajwerner
Copy link
Contributor Author

Hold up on this, I introduced an with a recent revision. I'll post when RFAL.

@ajwerner ajwerner force-pushed the ajwerner/prevent-replica-id-changes branch from 39392d4 to 0b9ba31 Compare September 13, 2019 19:00
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/prevent-replica-id-changes branch 3 times, most recently from 1ec375a to 53562b2 Compare September 13, 2019 21:09
@ajwerner
Copy link
Contributor Author

Okay this is RFAL. It's undoubtedly going to take some iteration. The testing work remains but I think the locking is now sound.

@tbg
Copy link
Member

tbg commented Sep 14, 2019

Not reviewing due to vacation, but this sounds very exciting. If course it's unfortunate that we have to do this Uber release pressure, but.. well I don't know that I'm really convinced that it's unfortunate. This is a major reduction in tech debt that I'm unapologetically happy to see.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very scary change to bring in at this late date. But I agree with @tbg that it's overall moving in the right direction, and I think it's probably better to do this than to try to go forward with learner replicas without this. I think it's a choice between this change and reverting learner replicas for this release.

My biggest concern is the transition: if we're moving to learner replicas by default, the interactions of all these changes with preemptive snapshots will almost certainly be under-tested.

Reviewed 24 of 24 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/storage/client_merge_test.go, line 1666 at r1 (raw file):

		repl, err := store1.GetReplica(rangeID)
		if err != nil {
			continue

This looks wrong and at least needs a comment. If we're now expecting an error here we should use a testutil assertion to make sure we're getting the specific error we want.


pkg/storage/client_merge_test.go, line 2039 at r1 (raw file):

	defer leaktest.AfterTest(t)()

	t.Skip("these invariants are no longer true")

Remember to delete any tests that are skipped because they no longer make sense before merging the PR (i like the skip-with-comment pattern for ease of early code reviewing)


pkg/storage/client_merge_test.go, line 2883 at r1 (raw file):

// mtcStoreRaftMessageHandler exists to allows a store to be stopped and
// restarted while maintaining a partition using an unreliableRaftHandler.
type mtcStoreRaftMessageHandler struct {

This is only used in client_raft_test.go. Maybe move this and unreliableRaftHandler to a client_raft_helpers_test.go?


pkg/storage/client_raft_test.go, line 1196 at r1 (raw file):

	// This used to fail with IntersectingSnapshotMsg because we relied on replica GC
	// to remove the LHS and that queue is disable. Now we will detect that the LHS is

s/disable/disabled/


pkg/storage/client_replica_gc_test.go, line 151 at r1 (raw file):

func TestReplicaGCQueueDropReplicaGCOnScan(t *testing.T) {
	defer leaktest.AfterTest(t)()
	t.Skip("the range will be removed synchronously now")

This test should probably be revived to prove that replicas that aren't GC'd synchronously will eventually be found asynchronously.


pkg/storage/client_split_test.go, line 3318 at r1 (raw file):

		_, err := tc.AddReplicas(kRHS, tc.Target(1))
		if !testutils.IsError(err, `snapshot intersects existing range|was not found on`) {
			t.Fatalf(`expected snapshot intersects existing range" error got: %+v`, err)

Update the fatal message too. Are both error messages now possible or is it reliably "was not found" now?


pkg/storage/client_test.go, line 1250 at r1 (raw file):

	return retry.ForDuration(testutils.DefaultSucceedsSoonDuration, func() error {
		_, err := m.stores[dest].GetReplica(rangeID)
		if err == nil {

Assert on the specific expected error here.


pkg/storage/client_test.go, line 1270 at r1 (raw file):

			log.VEventf(context.TODO(), 1, "engine %d: missing key %s", i, key)
		} else {
			var err error

Reducing the scope of the error here seems like a good practice, but I'm curious what prompted this specific case.


pkg/storage/queue.go, line 1179 at r1 (raw file):

		}

		// Replica not found, remove from set and try again.

s/not found/not found or was recreated with a new replica ID/


pkg/storage/replica.go, line 494 at r1 (raw file):

// ReplicaID returns the ID for the Replica. It may be zero if the replica does
// not know its ID.

Add comment that the ID may change from zero to non-zero, but once it has a non-zero value it will not change again.


pkg/storage/replica_application_result.go, line 310 at r1 (raw file):

	}
	// TODO(ajwerner): the data should already be destroyed at this point.
	// destroying the data again seems crazy.

Then why is it here? The "as late as possible" bit from the old code no longer applies (it was there because of the descriptor lookup in shouldQueue; we want to wait for the lease holder to apply the command).


pkg/storage/replica_application_state_machine.go, line 525 at r1 (raw file):

) bool {
	_, existsInDesc := desc.GetReplicaDescriptor(storeID)
	// NB: if we're catching up from a preemptive snapshot then we won't

It'll be nice to eventually remove these last vestiges of preemptive snapshots.


pkg/storage/replica_application_state_machine.go, line 539 at r1 (raw file):

	res := cmd.replicatedResult()

	// changeReplicasPreApply detects if this command will remove us from the range.

There's nothing named changeReplicasPreApply.


pkg/storage/replica_application_state_machine.go, line 540 at r1 (raw file):

	// changeReplicasPreApply detects if this command will remove us from the range.
	// If so we delete stage the removal of all of our range data into this batch.

s/delete //


pkg/storage/replica_application_state_machine.go, line 551 at r1 (raw file):

		// accepted. The replica will be destroyed in handleChangeReplicas.
		b.r.mu.Lock()
		b.r.mu.destroyStatus.Set(

It's not obvious that raftMu is held here, as required by the new comment.


pkg/storage/replica_destroy.go, line 128 at r1 (raw file):

// removeUninitializedReplica is called when we know that an uninitialized
// replica has been removed and re-added as a different replica. We're safe

Add more details here: after the first "replica", add "(which knows its replica ID but not its key range)", and after the second "replica" add "(with a new replica ID)").


pkg/storage/replica_destroy.go, line 130 at r1 (raw file):

// replica has been removed and re-added as a different replica. We're safe
// to GC its hard state because nobody cares about our votes anymore. The
// sad thing is we aren't safe to GC the range's data because we don't know

So will this sometimes leak data? Or is it guaranteed that in the case of the split the LHS will be able to take care of it? It's bad to introduce anything that could permanently leak data (we effectively had this before, but it was at least discoverable by scanning the range-id-local keyspace. If we delete the range ID local keys but leak regular data, it would be harder to discover and track).


pkg/storage/replica_init.go, line 220 at r1 (raw file):

	r.mu.minReplicaID = replicaID + 1

	// Reset the raft group to force its recreation on next usage.

Update or remove this comment.


pkg/storage/replica_init.go, line 232 at r1 (raw file):

	r.mu.internalRaftGroup = nil

	// If there was a previous replica, repropose its pending commands under

This was here for a reason; I think we need to be careful when dropping and recreating a Replica that we handle its pending commands properly. I don't think we have to repropose them, but they should at least get an error instead of being dropped on the floor.


pkg/storage/replica_learner_test.go, line 439 at r1 (raw file):

	tc.RemoveReplicasOrFatal(t, scratchStartKey, tc.Target(1)) // remove learner
	// We need to wait until the range has been GC'd on server 1 before
	// we know that it can be safely added again.

What does "safely" mean here? Just that adding it again will fail without breaking anything? If so, s/safely/successfully/.


pkg/storage/replica_raft.go, line 1556 at r1 (raw file):

	ctx context.Context, raftCmd storagepb.RaftCommand,
) (func(), error) {
	// There's a super nasty case where we're shutting down and we need to wait

What happens in this case? Should this be a TODO(ajwerner)?


pkg/storage/store.go, line 2357 at r1 (raw file):

			log.Fatalf(ctx, "refusing to clear replicated data for an initialized range %v in SplitRange", rightRepl)
		}
		// We want to delete this range's data but not its hard state as it

How could an uninitialized range have data covered by clearReplicatedOnly?


pkg/storage/store.go, line 2708 at r1 (raw file):

	defer s.mu.Unlock()

	// We need to launch an async task to remove this uninitialized replica.

Is this a stale comment? I'm not seeing the async task. It's also strange to see these sanity checks after the call to destroyUninitializedReplicaRafMuLocked.


pkg/storage/store.go, line 3581 at r1 (raw file):

			return roachpb.NewError(err)
		}
		// We need to make sure we actually stepped that message.

Shouldn't this be conveyed by stepRaftGroup instead of a separate method called afterwards?


pkg/storage/store.go, line 3643 at r1 (raw file):

				storeID := repl.store.StoreID()
				// TODO(ajwerner): use a better error than RangeNotFound.

This will require adding more error cases in various places, which will probably have backwards-compatibility implications.


pkg/storage/store.go, line 3801 at r1 (raw file):

			elapsed.Seconds(), stats.entriesProcessed, stats.batchesProcessed, stats.stateAssertions)
	}
	// TODO(ajwerner): When does this case happen?

Uninitialized replicas will process raft Readies in order to vote (perhaps among other things, but mainly for voting)


pkg/storage/store_snapshot.go, line 647 at r1 (raw file):

		// withReplicaForRequest ensures that requests with a non-zero replica
		// id are passed to a replica with a matching id. Given this is not a
		// preemptive snapshot we know that its id must be non-zero.

How do we know it's not preemptive? I thought this code path was reached for any kind of snapshot.


pkg/storage/store_snapshot.go, line 740 at r1 (raw file):

		return crdberrors.AssertionFailedf(`expected a raft or learner snapshot`)
	}
	r := retry.StartWithCtx(ctx, retry.Options{

This retry loop worries me because it suggests a new range of possible failure modes here. Have we marked exactly the correct set of error cases with errRetry? What are the consequences of getting this wrong in either direction?


pkg/storage/store_snapshot.go, line 762 at r1 (raw file):

}

// shouldAcceptSnapshotData is an optimization to check whether we should even

Move this comment to the method it describes.


pkg/storage/store_snapshot.go, line 801 at r1 (raw file):

	existingRepl.mu.RUnlock()

	// shouldDestroyExisting is set to true if after the switch the existingRepl

There's nothing called shouldDestroyExisting.


pkg/storage/store_snapshot.go, line 852 at r1 (raw file):

			// TODO(ajwerner): consider making this an assertion or using a better error.
			// We only ever snapshot from leaseholders. Maybe it's possible for an
			// old leaseholder to send an old snapshot. It'd be nice if we had a

Yes, I think old leaseholders are the reason for this case.


pkg/storage/store_snapshot.go, line 871 at r1 (raw file):

		// move backwards.
		existingDesc = existingRepl.mu.state.Desc
		if existingRepl.mu.destroyStatus.RemovalPending() {

Should a pending removal be an errRetry here? Comment why we're doing a second check of destroyStatus.


pkg/storage/store_test.go, line 573 at r1 (raw file):

		DestroyData: true,
	}); err != nil {
		t.Fatalf("didn't expected error re-removing same range: %v", err)

s/ed//

@ajwerner ajwerner force-pushed the ajwerner/prevent-replica-id-changes branch from 30ef8d5 to 3eada93 Compare September 16, 2019 13:31
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I took a first pass at your comments but need to keep working on testing. There are a couple scenarios that need new tests and a couple of disabled tests which definitely need to be updated.

I completely share your concern about mixed version clusters. I'm hopeful that it's generally going to be okay because in pre-learners we take care to not add replicas until we've successfully sent a pre-emptive snapshot which tends to mitigate many of these add and remove scenarios.

I'm all ears if you have more testing ideas.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)


pkg/storage/client_merge_test.go, line 1666 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This looks wrong and at least needs a comment. If we're now expecting an error here we should use a testutil assertion to make sure we're getting the specific error we want.

Definitely it's wrong. I started mucking around with this test and then decided to skip it. Will update.


pkg/storage/client_merge_test.go, line 2039 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remember to delete any tests that are skipped because they no longer make sense before merging the PR (i like the skip-with-comment pattern for ease of early code reviewing)

Will do, not quite there yet. I want to revisit each of these tests individually.


pkg/storage/client_merge_test.go, line 2883 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is only used in client_raft_test.go. Maybe move this and unreliableRaftHandler to a client_raft_helpers_test.go?

Good idea. I'm also going to type up a simple abstraction to make these things easier to set up and use.


pkg/storage/client_raft_test.go, line 1196 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/disable/disabled/

Done.


pkg/storage/client_replica_gc_test.go, line 151 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This test should probably be revived to prove that replicas that aren't GC'd synchronously will eventually be found asynchronously.

Agreed, will get to it next.


pkg/storage/client_split_test.go, line 3318 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Update the fatal message too. Are both error messages now possible or is it reliably "was not found" now?

Updated the message and added a comment. They both now happen.


pkg/storage/client_test.go, line 1250 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Assert on the specific expected error here.

Done.


pkg/storage/client_test.go, line 1270 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Reducing the scope of the error here seems like a good practice, but I'm curious what prompted this specific case.

Reverted. During an intermediate commit I had created a readBytesFromEngine method and this code was moved to part of a helper that returned []roachbpb.Value.


pkg/storage/queue.go, line 1179 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/not found/not found or was recreated with a new replica ID/

Done.


pkg/storage/replica.go, line 494 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Add comment that the ID may change from zero to non-zero, but once it has a non-zero value it will not change again.

Done.


pkg/storage/replica_application_result.go, line 310 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Then why is it here? The "as late as possible" bit from the old code no longer applies (it was there because of the descriptor lookup in shouldQueue; we want to wait for the lease holder to apply the command).

It's here as a hack. The applied state key gets written after the clearing of the data. Maybe other keys do too. This TODO definitely needs to get addressed before merging this change.


pkg/storage/replica_application_state_machine.go, line 525 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It'll be nice to eventually remove these last vestiges of preemptive snapshots.

Agreed. That will need to wait until 20.1.


pkg/storage/replica_application_state_machine.go, line 539 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There's nothing named changeReplicasPreApply.

right I inlined that method here, updated


pkg/storage/replica_application_state_machine.go, line 540 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/delete //

Done.


pkg/storage/replica_application_state_machine.go, line 551 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's not obvious that raftMu is held here, as required by the new comment.

everything underneath application holds the raftMu. Added more commentary.


pkg/storage/replica_application_state_machine.go, line 610 at r1 (raw file):

		// merge transaction commits.

		// TODO(ajwerner): we know we're going to get something here for the right

re: the interactions of all these changes with preemptive snapshots will almost certainly be under-tested.

This case should only arise in the presence of a preemptive snapshot. I don't think it's handled correctly here. I suspect that the right course of action if we are trying to apply a merge while not part of the range is to completely destroy the preemptive snapshot.
Agreed. That will need to wait until 20.1.
I think I definitely need to write a test to exercise this specific case before this merges. The logic here remains exposed to the hazard.


pkg/storage/replica_destroy.go, line 128 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Add more details here: after the first "replica", add "(which knows its replica ID but not its key range)", and after the second "replica" add "(with a new replica ID)").

Done.


pkg/storage/replica_destroy.go, line 130 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

So will this sometimes leak data? Or is it guaranteed that in the case of the split the LHS will be able to take care of it? It's bad to introduce anything that could permanently leak data (we effectively had this before, but it was at least discoverable by scanning the range-id-local keyspace. If we delete the range ID local keys but leak regular data, it would be harder to discover and track).

Updated the comment to make it less scary. Indeed we'll always end up removing the data. I was less generally confident of that when this was first typed.


pkg/storage/replica_init.go, line 220 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Update or remove this comment.

Done.


pkg/storage/replica_init.go, line 232 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This was here for a reason; I think we need to be careful when dropping and recreating a Replica that we handle its pending commands properly. I don't think we have to repropose them, but they should at least get an error instead of being dropped on the floor.

I think they will get an error. Both replica removal paths call cancelPendingCommandsLocked()


pkg/storage/replica_learner_test.go, line 439 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What does "safely" mean here? Just that adding it again will fail without breaking anything? If so, s/safely/successfully/.

Done.


pkg/storage/replica_raft.go, line 1556 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What happens in this case? Should this be a TODO(ajwerner)?

This is stale. I typed this comment before I made removal synchronous in getOrCreateReplica(). In fact it was this case that motivated the synchronous removal as opposed to using the queue. If we were shutting down and called getOrCreateReplica() underneath raft we'd crash if we returned an error but nothing would ever run the queue.

Fixed.


pkg/storage/store.go, line 2357 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How could an uninitialized range have data covered by clearReplicatedOnly?

Notice that the descriptor is from the split, not the uninitialized replica. The LHS had the data and now the RHS doesn't know about it. If we didn't clear the data then it could be leaked. I've updated the comment.

This needs a targeted test before this change merges.


pkg/storage/store.go, line 2708 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is this a stale comment? I'm not seeing the async task. It's also strange to see these sanity checks after the call to destroyUninitializedReplicaRafMuLocked.

All around, yes I agree. Sanity checks moved up


pkg/storage/store.go, line 3581 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Shouldn't this be conveyed by stepRaftGroup instead of a separate method called afterwards?

Yes it should. I've added a return value to stepRaftGroup and also to withRaftGroup. It potentially adds some cruft to the code but it at least forces callers to think about destroyStatus.


pkg/storage/store.go, line 3643 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This will require adding more error cases in various places, which will probably have backwards-compatibility implications.

Removed the TODO, added a NB


pkg/storage/store.go, line 3801 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Uninitialized replicas will process raft Readies in order to vote (perhaps among other things, but mainly for voting)

Thanks. I left this for myself to come back to a while ago.


pkg/storage/store_snapshot.go, line 647 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How do we know it's not preemptive? I thought this code path was reached for any kind of snapshot.

See the assertion at the top of the method. canApplyPreemptiveSnapshot is now a different method.


pkg/storage/store_snapshot.go, line 740 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This retry loop worries me because it suggests a new range of possible failure modes here. Have we marked exactly the correct set of error cases with errRetry? What are the consequences of getting this wrong in either direction?

errRetry should only be returned if the current replica to which this is being addressed has a true value forIsRemovalPending().

I'm increasingly convinced that this method could use withReplicaForRequest. The logic is very similar and the larger store mutex footprint here doesn't necessarily seem justified. It seems okay to grab the mutex before calling checkSnapshotOverlapLocked


pkg/storage/store_snapshot.go, line 762 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Move this comment to the method it describes.

Done.


pkg/storage/store_snapshot.go, line 801 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There's nothing called shouldDestroyExisting.

That was stale, updated.


pkg/storage/store_snapshot.go, line 852 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yes, I think old leaseholders are the reason for this case.

Done.


pkg/storage/store_snapshot.go, line 871 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Should a pending removal be an errRetry here? Comment why we're doing a second check of destroyStatus.

Yes, pending means another goroutine is synchronously removing this replica. I augmented the comment above to clarify why we're checking the destroy status again.


pkg/storage/store_test.go, line 573 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/ed//

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working my way through, but about to get pulled away from the review for a bit, so publishing what I've looked at so far.

Reviewed 6 of 24 files at r1, 8 of 18 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)


pkg/storage/queue.go, line 62 at r2 (raw file):

// processing state.
type replicaItem struct {
	value     roachpb.RangeID

nit: s/value/rangeID/ in its own commit.


pkg/storage/queue.go, line 1174 at r2 (raw file):

		repl, _ := bq.getReplica(item.value)
		if repl != nil {
			if item.replicaID == 0 || item.replicaID == repl.ReplicaID() {

nit: pull this into a sameReplica helper with a little description about the invariant it's meant to uphold.


pkg/storage/queue.go, line 1174 at r2 (raw file):

		repl, _ := bq.getReplica(item.value)
		if repl != nil {
			if item.replicaID == 0 || item.replicaID == repl.ReplicaID() {

This could use a test in queue_test.go.


pkg/storage/replica.go, line 494 at r2 (raw file):

// ReplicaID returns the ID for the Replica. It may be zero if the replica does
// not know its ID. One a Replica has a non-zero ReplicaID it will never change.

s/One/Once/

Also, minor nit but this definition seems a little high in this file. It's above the interface assertions and the constructor. Perhaps move above Replica.String?


pkg/storage/replica_gc_queue.go, line 117 at r2 (raw file):

func (rgcq *replicaGCQueue) shouldQueue(
	ctx context.Context, now hlc.Timestamp, repl *Replica, _ *config.SystemConfig,
) (should bool, prio float64) {

nit: ) (shouldQ bool, priority float64) { seems to be the standard naming.


pkg/storage/replica_gc_queue.go, line 226 at r2 (raw file):

	// Maybe it was deleted "under us" by being moved.
	currentDesc, currentMember := replyDesc.GetReplicaDescriptor(repl.store.StoreID())
	if desc.RangeID == replyDesc.RangeID && currentMember && currentDesc.ReplicaID == replicaID {

Does every single queue need to perform this check? I thought this was why we put it in baseQueue. Or is this special because it's checking the current replica descriptor from the meta range?


pkg/storage/replica_gc_queue.go, line 289 at r2 (raw file):

		// If we're a member of the current range descriptor then we need to put
		// down a tombstone that will allow us to get a snapshot. Otherwise put

Why will a tombstone allow us to get a snapshot? Is this corresponding to the case where if desc.RangeID == replyDesc.RangeID && currentMember && currentDesc.ReplicaID > replicaID { and you want to allow the replica in the descriptor to be added to the store once the current Replica object is GC'ed? Could you elaborate on this a little bit?


pkg/storage/replica_init.go, line 162 at r2 (raw file):

func (r *Replica) setReplicaIDRaftMuLockedMuLocked(replicaID roachpb.ReplicaID) error {
	if r.mu.replicaID == replicaID {

Now that we're calling this in fewer places, do we even need such a general interface? Can we make this panic in all cases that r.mu.replica != 0?


pkg/storage/replica_init.go, line 179 at r2 (raw file):

	}
	ctx := r.AnnotateCtx(context.TODO())
	if r.mu.replicaID != 0 && replicaID != 0 {

We already checked that replicaID != 0 above.


pkg/storage/replica_init.go, line 187 at r2 (raw file):

	}

	// if r.mu.replicaID != 0 {

We can remove this now.


pkg/storage/replica_init.go, line 268 at r2 (raw file):

	defer r.mu.Unlock()

	// If we raced on the checking the destroyStatus above that's fine as

"on the checking the"


pkg/storage/replica_proposal_buf.go, line 622 at r2 (raw file):

	// Pass true for mayCampaignOnWake because we're about to propose a command.
	// Ignore removal errors: the proposals which might be resident in this buffer
	// must also be in the Replica's proposals map and will be notified.

Why is this true? If the proposals in the buffer haven't been flushed then they won't be in the Replica's proposal map. I think FlushLockedWithoutProposing and its uses might be relevant to answering this question.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 18 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @bdarnell)


pkg/storage/replica_application_result.go, line 310 at r1 (raw file):

Previously, ajwerner wrote…

It's here as a hack. The applied state key gets written after the clearing of the data. Maybe other keys do too. This TODO definitely needs to get addressed before merging this change.

You addressed this by not writing the applied state key, right?


pkg/storage/replica_application_state_machine.go, line 530 at r2 (raw file):

		return false
	}
	_, existsInChange := change.Desc.GetReplicaDescriptor(storeID)

Do we need backwards compat with ChangeReplicas that only included a DeprecatedUpdatedReplicas?


pkg/storage/replica_application_state_machine.go, line 601 at r2 (raw file):

			ctx, b.batch, b.batch, merge.RightDesc.NextReplicaID, clearRangeIDLocalOnly, mustClearRange,
		); err != nil {
			return wrapWithNonDeterministicFailure(err, "unable to destroy range before merge")

Do you mind s/range/replica/?


pkg/storage/replica_application_state_machine.go, line 642 at r2 (raw file):

	}

	// Detect if this command will remove us from the range.

nit: move right below the res.Merge case.


pkg/storage/replica_application_state_machine.go, line 664 at r2 (raw file):

		if err := b.r.preDestroyRaftMuLocked(
			ctx,
			b.r.Engine(),

s/b.r.Engine()/b.batch/


pkg/storage/replica_application_state_machine.go, line 668 at r2 (raw file):

			change.Desc.NextReplicaID,
			clearAll,
			true, /* mustClearRange */

Why is this true? Maybe we should rename this to mustUseRangeDelTombstone or something to avoid confusion.


pkg/storage/replica_application_state_machine.go, line 670 at r2 (raw file):

			true, /* mustClearRange */
		); err != nil {
			return err

return wrapWithNonDeterministicFailure(err, "unable to destroy replica before removal")


pkg/storage/replica_application_state_machine.go, line 930 at r2 (raw file):

		// and the first range has a lease that there will not be a later hard-state.
		if shouldAssert {
			sm.r.mu.Lock()

I wouldn't tie this to shouldAssert. It would be better to return a new flag from handleNonTrivialReplicatedEvalResult.


pkg/storage/replica_application_state_machine.go, line 1068 at r2 (raw file):

sm.batch.changeRemovesReplica

The batch is only a member of replicaStateMachine to avoid an allocation, so this isn't really kosher. The right way to do this is to either rediscover this fact (like we were doing before, right?) or to annotate the replicatedCmd.


pkg/storage/replica_application_state_machine.go, line 1096 at r2 (raw file):

			return nil
		}
		// If we're removed then we know that must have happened due to this

But we wouldn't have even made it here before returning a ErrRemoved error, right? So can we assert that the replica is never removed?


pkg/storage/replica_destroy.go, line 137 at r2 (raw file):

// created in anticipation of a split in which case we'll clear its data when
// the split trigger is applied.
func (r *Replica) destroyUninitializedReplicaRaftMuLocked(

Should this return an error that the caller can decide what to do with?


pkg/storage/replica_destroy.go, line 148 at r2 (raw file):

		batch,
		nextReplicaID,
		clearRangeIDLocalOnly,

Give this a small comment.


pkg/storage/replica_raft.go, line 824 at r2 (raw file):

	// above with the apply.ErrRemoved.
	if isRemoved {
		return stats, expl, fmt.Errorf("replica removed while processing raft ready")

nit: errors.Errorf

Also, "replica unexpectedly removed ..."


pkg/storage/store.go, line 2166 at r2 (raw file):

		// only after we snapshot but for the sake of mixed version clusters and
		// other potential behavior changes we pass the desc for the known to be
		// old RHS and have SplitRange clear its data.

Add one more sentence about short-circuiting the rest of this function.


pkg/storage/store.go, line 2180 at r2 (raw file):

		// snapshot we couldn't possibly have received.
		if tombstone.NextReplicaID > rightDesc.ReplicaID {
			rightRng.mu.minReplicaID = 0

I don't follow from the comment above this why we only set minReplicaID to 0 when the tombstone is past the rightDesc.ReplicaID.


pkg/storage/store.go, line 2285 at r2 (raw file):

// replica has been removed and re-added before applying this split and we need
// to leave it uninitialized. It's not possible for it to have become
// initialized

Missing a period.

Also, mention that this method clears the data on the RHS in this case.


pkg/storage/store.go, line 2313 at r2 (raw file):

		// NB: We only remove from uninitReplicas and the replicaQueues maps here
		// so that we don't leave open a window where a replica is temporarily not
		// present in Store.mu.replicas.

Add to this comment about the case where oldRightDesc != nil.


pkg/storage/store.go, line 2691 at r2 (raw file):

	// TODO(peter): Could release s.mu.Lock() here.
	s.maybeGossipOnCapacityChange(ctx, rangeRemoveEvent)
	s.replicaGCQueue.MaybeRemove(rep.RangeID)

Isn't this called by s.scanner.RemoveReplica?


pkg/storage/store.go, line 3670 at r2 (raw file):

				repl.mu.Unlock()
				if !repl.IsInitialized() {
					return s.removeUninitializedReplicaRaftMuLocked(ctx, repl, tErr.ReplicaID+1)

nit: pull out nextReplicaID := tErr.ReplicaID+1 above.


pkg/storage/store.go, line 4054 at r2 (raw file):

	creatingReplica *roachpb.ReplicaDescriptor,
) (_ *Replica, created bool, _ error) {
	r := retry.Start(retry.Options{

Leave a comment about why we need a retry loop so we might eventually feel comfortable removing it.


pkg/storage/store.go, line 4113 at r2 (raw file):

			repl.mu.Unlock()
			defer repl.mu.Lock()
			if !repl.isInitializedRLocked() {

Didn't we just release the lock?


pkg/storage/store.go, line 4148 at r2 (raw file):

		var err error
		if repl.mu.replicaID == 0 {
			err = repl.setReplicaIDRaftMuLockedMuLocked(replicaID)

What's the point of this? Don't we know this to be a no-op?


pkg/storage/store_snapshot.go, line 740 at r1 (raw file):

Previously, ajwerner wrote…

errRetry should only be returned if the current replica to which this is being addressed has a true value forIsRemovalPending().

I'm increasingly convinced that this method could use withReplicaForRequest. The logic is very similar and the larger store mutex footprint here doesn't necessarily seem justified. It seems okay to grab the mutex before calling checkSnapshotOverlapLocked

Where did you land on this?


pkg/storage/store_snapshot.go, line 871 at r1 (raw file):

Previously, ajwerner wrote…

Yes, pending means another goroutine is synchronously removing this replica. I augmented the comment above to clarify why we're checking the destroy status again.

Be a bit more explicit here. We're looking at this again because we released the lock above and are re-acquiring it again.


pkg/storage/store_snapshot.go, line 807 at r2 (raw file):

	// intended for existingRepl and if not, potentially destroy it.
	switch ds.reason {
	case destroyReasonRemovalPending, destroyReasonRemoved:

Is it worth checking whether the snapshot is intended for the replica ID that is being removed?


pkg/storage/store_snapshot.go, line 816 at r2 (raw file):

		return nil
	case destroyReasonAlive:

nit: stray line


pkg/storage/store_snapshot.go, line 843 at r2 (raw file):

		// replicaID) before the snapshot arrives.
		if okReplicaID {
			if !existingDesc.IsInitialized() && okReplicaID {

No need to check okReplicaID again.


pkg/storage/store_snapshot.go, line 860 at r2 (raw file):

		}

		// We think the snapshot is addressed to a higher replica ID which

We think?


pkg/storage/store_snapshot.go, line 880 at r2 (raw file):

		}
		if existingRepl.mu.replicaID == snapReplicaDesc.ReplicaID {
			if !existingDesc.IsInitialized() {

We have the same logic above and I'm trying to make sense of it. Only if the descriptor isn't initialized do we need to check for overlaps? Why is that? Are we searching for placeholders?

@ajwerner ajwerner force-pushed the ajwerner/prevent-replica-id-changes branch from 3eada93 to 48bd0b5 Compare September 16, 2019 18:35
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed some more comments but I still have more I want to do so hold off on taking another pass.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)


pkg/storage/queue.go, line 62 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: s/value/rangeID/ in its own commit.

I'll move all of the queue changes over to their own commit.


pkg/storage/queue.go, line 1174 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: pull this into a sameReplica helper with a little description about the invariant it's meant to uphold.

Done.


pkg/storage/queue.go, line 1174 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This could use a test in queue_test.go.

Done.


pkg/storage/replica.go, line 494 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/One/Once/

Also, minor nit but this definition seems a little high in this file. It's above the interface assertions and the constructor. Perhaps move above Replica.String?

Done.


pkg/storage/replica_application_result.go, line 310 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You addressed this by not writing the applied state key, right?

Correct.


pkg/storage/replica_application_state_machine.go, line 539 at r1 (raw file):

Previously, ajwerner wrote…

right I inlined that method here, updated

Done. I moved this whole block down too.


pkg/storage/replica_application_state_machine.go, line 551 at r1 (raw file):

Previously, ajwerner wrote…

everything underneath application holds the raftMu. Added more commentary.

Done.


pkg/storage/replica_application_state_machine.go, line 642 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move right below the res.Merge case.

Doesn't this need to come after the truncated state case? Otherwise we'll write the truncated state to the batch after.


pkg/storage/replica_application_state_machine.go, line 664 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/b.r.Engine()/b.batch/

Done.


pkg/storage/replica_application_state_machine.go, line 668 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why is this true? Maybe we should rename this to mustUseRangeDelTombstone or something to avoid confusion.

I don't know. It felt right given the name. Would you be happy with false?


pkg/storage/replica_application_state_machine.go, line 670 at r2 (raw file):

err, "unable to destroy replica before removal")
Done.


pkg/storage/replica_application_state_machine.go, line 930 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I wouldn't tie this to shouldAssert. It would be better to return a new flag from handleNonTrivialReplicatedEvalResult.

Done.


pkg/storage/replica_application_state_machine.go, line 1068 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
sm.batch.changeRemovesReplica

The batch is only a member of replicaStateMachine to avoid an allocation, so this isn't really kosher. The right way to do this is to either rediscover this fact (like we were doing before, right?) or to annotate the replicatedCmd.

Done.


pkg/storage/replica_application_state_machine.go, line 1096 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

But we wouldn't have even made it here before returning a ErrRemoved error, right? So can we assert that the replica is never removed?

Sure, assertion added.


pkg/storage/replica_destroy.go, line 137 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this return an error that the caller can decide what to do with?

Done.


pkg/storage/replica_gc_queue.go, line 117 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: ) (shouldQ bool, priority float64) { seems to be the standard naming.

Done.


pkg/storage/replica_gc_queue.go, line 226 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does every single queue need to perform this check? I thought this was why we put it in baseQueue. Or is this special because it's checking the current replica descriptor from the meta range?

I had to think about this a bunch but at the end I concluded that we don't need this case. The case that concerned me is where the replicaID is 0. I did however go back and add logic to improve the purgatory code because it wasn't right. I'm going


pkg/storage/replica_gc_queue.go, line 289 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why will a tombstone allow us to get a snapshot? Is this corresponding to the case where if desc.RangeID == replyDesc.RangeID && currentMember && currentDesc.ReplicaID > replicaID { and you want to allow the replica in the descriptor to be added to the store once the current Replica object is GC'ed? Could you elaborate on this a little bit?

I've reflected on this some and I think we're safe because of the sanity checks in RemoveReplica. Reverted.


pkg/storage/replica_init.go, line 162 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Now that we're calling this in fewer places, do we even need such a general interface? Can we make this panic in all cases that r.mu.replica != 0?

Done.


pkg/storage/replica_init.go, line 187 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We can remove this now.

Done.


pkg/storage/replica_init.go, line 268 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"on the checking the"

Done.


pkg/storage/replica_proposal_buf.go, line 622 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why is this true? If the proposals in the buffer haven't been flushed then they won't be in the Replica's proposal map. I think FlushLockedWithoutProposing and its uses might be relevant to answering this question.

Wishful thinking.


pkg/storage/replica_raft.go, line 824 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: errors.Errorf

Also, "replica unexpectedly removed ..."

Done.


pkg/storage/store.go, line 2166 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add one more sentence about short-circuiting the rest of this function.

Done.


pkg/storage/store.go, line 2180 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't follow from the comment above this why we only set minReplicaID to 0 when the tombstone is past the rightDesc.ReplicaID.

If the tombstone is at or below rightDesc.ReplicaID then we don't need to touch minReplicaID.
I added some more commentary


pkg/storage/store.go, line 2285 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Missing a period.

Also, mention that this method clears the data on the RHS in this case.

Done.


pkg/storage/store.go, line 2313 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add to this comment about the case where oldRightDesc != nil.

Done.


pkg/storage/store.go, line 2691 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Isn't this called by s.scanner.RemoveReplica?

Indeed. Removed.


pkg/storage/store.go, line 3670 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: pull out nextReplicaID := tErr.ReplicaID+1 above.

Done.


pkg/storage/store.go, line 4054 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Leave a comment about why we need a retry loop so we might eventually feel comfortable removing it.

The retry loop isn't new but sure, added.


pkg/storage/store.go, line 4113 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Didn't we just release the lock?

Fixed.


pkg/storage/store.go, line 4148 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What's the point of this? Don't we know this to be a no-op?

No, this is the case where a replica learns about its replica ID. This is common in the case of a pre-emptive snapshot.


pkg/storage/store_snapshot.go, line 740 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Where did you land on this?

I want to do it.


pkg/storage/store_snapshot.go, line 807 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is it worth checking whether the snapshot is intended for the replica ID that is being removed?

Sure, we could return the error here. THis case is


pkg/storage/store_snapshot.go, line 860 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We think?

It's possible that it's not because we're not currently holding the mutex.


pkg/storage/store_snapshot.go, line 880 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We have the same logic above and I'm trying to make sense of it. Only if the descriptor isn't initialized do we need to check for overlaps? Why is that? Are we searching for placeholders?

If the descriptor is initialized and covers more ranges then we'll subsume them. We need to accept a wider snapshot assuming we're in the range at this point.

@ajwerner ajwerner force-pushed the ajwerner/prevent-replica-id-changes branch from 48bd0b5 to 6b42756 Compare September 16, 2019 20:07
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Sep 16, 2019
Before this PR we would write such a tombstone when detecting that a range had
been merged via a snapshot or via the replica gc queue but curiously not when
merging the range by applying a merge.

Release Justification: Came across this oddity while working on updating tests
for cockroachdb#40751. Maybe is not justified.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/prevent-replica-id-changes branch from 6b42756 to fbb46bc Compare September 17, 2019 12:09
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress continues but it's still not quite there yet. On the bright side I am able to produce a number of failures by performing a version upgrade during TestRepartitioning:

15e51bb

Over a long running period this should give us some added confidence. We should also stress the version upgrade tests.

Would it be helpful to split this change into ~3 commits.

  1. Make queues replica ID aware
  2. Ensure that replica IDs don't change
  3. Ensure that splits and merges are only processed by replicas in the range

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)


pkg/storage/replica_proposal_buf.go, line 622 at r2 (raw file):

Previously, ajwerner wrote…

Wishful thinking.

Returning the error now. This should be the right thing.


pkg/storage/store_snapshot.go, line 740 at r1 (raw file):

Previously, ajwerner wrote…

I want to do it.

This turns out to be a major simplification and I think just works.

@ajwerner ajwerner force-pushed the ajwerner/prevent-replica-id-changes branch from fbb46bc to 2358606 Compare September 17, 2019 14:09
@ajwerner ajwerner requested review from a team as code owners September 17, 2019 14:09
@ajwerner ajwerner force-pushed the ajwerner/prevent-replica-id-changes branch from 2358606 to 9d8507e Compare September 17, 2019 14:47
@ajwerner
Copy link
Contributor Author

Temporarily messed the diff up.

WARNING: this change needs more testing to target the changes it makes
and at least some of the disabled tests should be reworked. This is a big
and scary change at this point in the cycle so I'm getting it out before
I'm really happy with it. There are some known TODOs.

On the plus side it does not seem to reproduce any crashes in hours with the
`partitionccl.TestRepartitioning` which readily produces crashes on master
when run under roachprod stress within ~20 minutes.

We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.
 * The queues are now replica ID aware. If a replica was added to the queue
   and the replica found when trying to pop are not the same and we knew the
   replica ID of replica when we added it then we should not process it.

This hopefully gives us some new invariants:

 * There is only ever at most 1 *Replica which IsAlive() for a range on a store
   at a time.
 * Once a *Replica has a non-zero ReplicaID is never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
@ajwerner ajwerner force-pushed the ajwerner/prevent-replica-id-changes branch from 9d8507e to 6f7c3f2 Compare September 18, 2019 00:00
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Sep 18, 2019
This changes the partitioning tests to use 9 replicas rather than 3 and
specifies lease_preferences in addition to constraints to allow the same
types of scans to work.

This test was intended to help catch errors cockroachdb#40751 but somehow seems to
reproduce cockroachdb#40213 even
on master.

```
I190918 05:26:52.501218 53874 storage/replica_gc_queue.go:286  [n1,replicaGC,s1,r14/1:/Table/1{8-9}] destroying local data
I190918 05:26:52.501282 53874 storage/store.go:2613  [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removing replica r14/1
I190918 05:26:52.502818 53874 storage/replica_destroy.go:152  [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removed 12 (0+12) keys in 1ms [clear=0ms commit=0ms]
I190918 05:26:56.941153 80013 storage/replica_command.go:1580  [n4,replicate,s4,r14/10:/Table/1{8-9}] change replicas (add [(n1,s1):12LEARNER] remove []): existing descriptor r14:/Table/1{8-9} [(n6,s6):11, (n9,s9):2, (n8,s8):9, (n2,s2):7, (n4,s4):10, next=12, gen=26]
F190918 05:26:57.019170 154 storage/store.go:4053  [n1,s1] found non-zero HardState.Commit on uninitialized replica [n1,s1,r14/?:{-}]. HS={Term:7 Vote:10 Commit:75 XXX_unrecognized:[]}
```

This reproduction occurs readily on this branch running roachprod stress:

```
make roachprod-stress CLUSTER=$USER-stress-2 PKG=./pkg/ccl/partitionccl TESTS=TestRepartitioning
```

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Sep 18, 2019
This changes the partitioning tests to use 9 replicas rather than 3 and
specifies lease_preferences in addition to constraints to allow the same
types of scans to work.

This test was intended to help catch errors cockroachdb#40751 but somehow seems to
reproduce cockroachdb#40213 even
on master.

```
I190918 05:26:52.501218 53874 storage/replica_gc_queue.go:286  [n1,replicaGC,s1,r14/1:/Table/1{8-9}] destroying local data
I190918 05:26:52.501282 53874 storage/store.go:2613  [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removing replica r14/1
I190918 05:26:52.502818 53874 storage/replica_destroy.go:152  [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removed 12 (0+12) keys in 1ms [clear=0ms commit=0ms]
I190918 05:26:56.941153 80013 storage/replica_command.go:1580  [n4,replicate,s4,r14/10:/Table/1{8-9}] change replicas (add [(n1,s1):12LEARNER] remove []): existing descriptor r14:/Table/1{8-9} [(n6,s6):11, (n9,s9):2, (n8,s8):9, (n2,s2):7, (n4,s4):10, next=12, gen=26]
F190918 05:26:57.019170 154 storage/store.go:4053  [n1,s1] found non-zero HardState.Commit on uninitialized replica [n1,s1,r14/?:{-}]. HS={Term:7 Vote:10 Commit:75 XXX_unrecognized:[]}
...
```

This reproduction occurs readily on this branch running roachprod stress:

```
make roachprod-stress CLUSTER=$USER-stress-2 PKG=./pkg/ccl/partitionccl TESTS=TestRepartitioning
```

Release note: None
…arner

This commit adds an annotation to raft request messages to indicate that the
sender believes the current replica is a learner. If the current replica on
the recipient was created as a preemptive snapshot (it's initialized but not
in the range) then we should remove that replica immediately.

This commit also contains some testing of scenarios where we upgrade from a
preemptive snapshot.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/prevent-replica-id-changes branch from 6f7c3f2 to 966bce6 Compare September 18, 2019 14:00
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Sep 18, 2019
This changes the partitioning tests to use 9 replicas rather than 3 and
specifies lease_preferences in addition to constraints to allow the same
types of scans to work.

This test was intended to help catch errors cockroachdb#40751 but somehow seems to
reproduce cockroachdb#40213 even
on master.

```
I190918 05:26:52.501218 53874 storage/replica_gc_queue.go:286  [n1,replicaGC,s1,r14/1:/Table/1{8-9}] destroying local data
I190918 05:26:52.501282 53874 storage/store.go:2613  [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removing replica r14/1
I190918 05:26:52.502818 53874 storage/replica_destroy.go:152  [n1,replicaGC,s1,r14/1:/Table/1{8-9}] removed 12 (0+12) keys in 1ms [clear=0ms commit=0ms]
I190918 05:26:56.941153 80013 storage/replica_command.go:1580  [n4,replicate,s4,r14/10:/Table/1{8-9}] change replicas (add [(n1,s1):12LEARNER] remove []): existing descriptor r14:/Table/1{8-9} [(n6,s6):11, (n9,s9):2, (n8,s8):9, (n2,s2):7, (n4,s4):10, next=12, gen=26]
F190918 05:26:57.019170 154 storage/store.go:4053  [n1,s1] found non-zero HardState.Commit on uninitialized replica [n1,s1,r14/?:{-}]. HS={Term:7 Vote:10 Commit:75 XXX_unrecognized:[]}
...
```

This reproduction occurs readily on this branch running roachprod stress:

```
make roachprod-stress CLUSTER=$USER-stress-2 PKG=./pkg/ccl/partitionccl TESTS=TestRepartitioning
```

Release note: None
@ajwerner
Copy link
Contributor Author

Closed in favor of #40892.

@ajwerner ajwerner closed this Sep 19, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Sep 20, 2019
Before this PR we would write such a tombstone when detecting that a range had
been merged via a snapshot or via the replica gc queue but curiously not when
merging the range by applying a merge.

Release Justification: Came across this oddity while working on updating tests
for cockroachdb#40751.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 20, 2019
40814: storage: write a tombstone with math.MaxInt32 when processing a merge r=ajwerner a=ajwerner

Before this PR we would write such a tombstone when detecting that a range had
been merged via a snapshot or via the replica gc queue but curiously not when
merging the range by applying a merge.

Release Justification: Came across this oddity while working on updating tests
for #40751. Maybe is not justified.

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: must never reuse state from old replicaIDs
5 participants