Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage: delay application of preemptive snapshots
Preemptive snapshots are sent to a Store (by another Store) as part of the process of adding a new Replica to a Range. The sequence of events is: - send a preemptive snapshot (replicaID=0) to the target - target creates a Replica from the preemptive snapshot (replicaID=0) - allocate new replicaID and add the target officially under that replicaID - success (replicaID=nonzero) They are problematic for a variety of reasons: 1. they introduce a Replica state, namely that of Replicas that have data but don't have a replicaID. Such replicas can't serve traffic and can't even have an initialized Raft group, so they're barely Replicas at all. Every bit of code in Replica needs to know about that. 2. the above state is implemented in an ad-hoc fashion and adds significantly to the complexity of the Store/Replica codebase. 3. Preemptive snapshots are subject to accidental garbage collection. There's currently no mechanism to decide whether a preemptive snapshot is simply waiting to be upgraded or whether it's abandoned. Accidental deletion causes another snapshot (this time Raft) to be sent. 4. Adding to 1., there are transitions between regular Replicas and preemptive snapshots that add additional complexity. For example, a regular snapshot can apply on top of a preemptive snapshot and vice versa. We try to prevent some of them but there are technical problems. 5. Preemptive snapshots have a range descriptor that doesn't include the Replica created from them. This is another gotcha that code needs to be aware of. (we cannot fix this in the first iteration, but it will be fixed when [learner replicas] are standard) Our answer to all but the last of these problems is that we want to remove the concept of preemptive snapshots altogether and instead rely on [learner replicas]. This is a Raft concept denoting essentially a member of a replication group without a vote. By replacing the preemptive snapshot with the addition of a learner replica (before upgrading to a full voting member), preemptive snapshots are replaced by full replicas with a flag set. However, as often the case, the interesting question becomes that of the migration, or, the possibility of running a mixed version cluster in which one node knows about these changes and another doesn't. The basic requirement that falls out of this is that we have to be able to send preemptive snapshots to followers even using the new code, and we have to be able to receive preemptive snapshots using the new code (though that code will go cold once the cluster setting upgrade has happened). Fortunately, sending and receiving preemptive snapshots is not what makes them terrible. In fact, the code that creates and receives preemptive snapshots is 100% shared with that for Raft snapshots. The complexity surrounding preemptive snapshots come from what happens when they are used to create a Replica object too early, but this is an implementation detail not visible across RPC boundaries. This suggests investigating how we can receive preemptive snapshots without actually using any of the internal code that handles them, so that this code can be removed in 19.2. The basic idea is that we will write the preemptive snapshot to a temporary location (instead of creating a Replica from it, and apply it as a Raft snapshot the moment we observe a local Replica for the matching RangeID created as a full member of the Raft group (i.e. with nonzero replicaID). This is carried out in this PR. Preemptive snapshots are put into a temporary in-memory map the size of which we aggressively keep under control (and which is cleared out periodically). Replica objects with replicaID zero are no longer instantiated. See the companion POC [learner replicas] which doesn't bother about the migration but explores actually using learner replicas. When learner replicas are standard, 5. above is also mostly addressed: the replica will always be contained in its range descriptor, even though it may be as a learner. TODO(tbg): preemptive snapshots stored on disk before this PR need to be deleted before we instantiate a Replica from them (because after this PR that will fail). [learner replicas]: #35787 [SST snapshots]: #25134 Release note: None
- Loading branch information