storage: delay application of preemptive snapshots #35786

Currently, in-memory Replica objects can end up having a replicaID zero. Roughly speaking, this is always the case when a Replica's range descriptor does not contain the Replica's store, though sometimes we do have a replicaID taken from incoming Raft messages (which then won't survive across a restart). We end up in this unnatural state mostly due to preemptive snapshots, which are a snapshot of the Range state before adding a certain replica, sent to the store that will house that replica once the configuration change to add it has completed. The range descriptor in the snapshot cannot yet assign the Replica a proper replicaID because none has been allocated yet (and this allocation has to be done in the replica change transaction, which hasn't started yet). Even when the configuration change completes and the leader starts "catching up" the preemptive snapshot and informs it of the replicaID, it will take a few moments until the Replica catches up to the log entry that actually updates the descriptor. If the node reboots before that log entry is durably applied, the replicaID will "restart" at zero until the leader contacts the Replica again. This suggests that preemptive snapshots introduce fundamental complexity which we'd like to avoid - as long as we use preemptive snapshots there will not be sanity in this department. This PR introduces a mechanism which delays the application of preemptive snapshots so that we apply them only when the first request *after* the completed configuration change comes in (at which point a replicaID is present). Superficially, this seems to solve the above problem (since the Replica will only be instantiated the moment a replicaID is known), though it doesn't do so across restarts. However, if we synchronously persisted (not done in this PR) the replicaID from incoming Raft messages whenever it changed, it seems that we should always be able to assign a replicaID when creating a Replica, even when dealing with descriptors that don't contain the replica itself (since there would've been a Raft message with a replicaID at some point, and we persist that). This roughly corresponds to persisting `Replica.lastToReplica`. We ultimately want to switch to learner replicas instead of preemptive snapshots. Learner replicas have the advantage that they are always represented in the replica descriptor, and so the snapshot that initializes them will be a proper Raft snapshot containing a descriptor containing the learner Replica itself. However, it's clear that we need to continue supporting preemptive snapshots in 19.2 due to the need to support mixed 19.1/19.2 clusters. This PR in conjunction with persisting the replicaID (and auxiliary work, for example on the split lock which currently also creates a replica with replicaID zero and which we know [has bugs]) should allow us to remove replicaID zero from the code base without waiting out the 19.1 release cycle. [has bugs]: cockroachdb#21146 Release note: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: delay application of preemptive snapshots #35786

storage: delay application of preemptive snapshots #35786

Commits on Apr 9, 2019