WARNING: this change needs more testing to target the changes it makes
and at least some of the disabled tests should be reworked. This is a big
and scary change at this point in the cycle so I'm getting it out before
I'm really happy with it. There are some known TODOs.
On the plus side it does not seem to reproduce any crashes in hours with the
`partitionccl.TestRepartitioning` which readily produces crashes on master
when run under roachprod stress within ~20 minutes.
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.
Here's a high level overview of the change:
* Once a Replica object has a non-zero Replica.mu.replicaID it will not
change.
* If a raft message or snapshot addressed to a higher replica ID is received
the current replica will be removed completely.
* If a replica sees a ChangeReplicasTrigger which removes it then it
completely removes itself while applying that command.
* Replica.mu.destroyStatus is used to meaningfully signify the removal state
of a Replica. Replicas about to be synchronously removed are in
destroyReasonRemovalPending.
* The queues are now replica ID aware. If a replica was added to the queue
and the replica found when trying to pop are not the same and we knew the
replica ID of replica when we added it then we should not process it.
This hopefully gives us some new invariants:
* There is only ever at most 1 *Replica which IsAlive() for a range on a store
at a time.
* Once a *Replica has a non-zero ReplicaID is never changes.
The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.
Fixes cockroachdb#40367.
Release justification: This commit is safe for 19.2 because it fixes release
blockers.
Release note (bug fix): Fix crashes by preventing replica ID change.