Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: delay application of preemptive snapshots #35786

Closed

Commits on Apr 9, 2019

  1. storage: delay application of preemptive snapshots

    Currently, in-memory Replica objects can end up having a replicaID zero.
    Roughly speaking, this is always the case when a Replica's range
    descriptor does not contain the Replica's store, though sometimes we do
    have a replicaID taken from incoming Raft messages (which then won't
    survive across a restart).
    
    We end up in this unnatural state mostly due to preemptive snapshots,
    which are a snapshot of the Range state before adding a certain replica,
    sent to the store that will house that replica once the configuration
    change to add it has completed. The range descriptor in the snapshot
    cannot yet assign the Replica a proper replicaID because none has been
    allocated yet (and this allocation has to be done in the replica change
    transaction, which hasn't started yet).
    
    Even when the configuration change completes and the leader starts
    "catching up" the preemptive snapshot and informs it of the replicaID,
    it will take a few moments until the Replica catches up to the log entry
    that actually updates the descriptor. If the node reboots before that
    log entry is durably applied, the replicaID will "restart" at zero until
    the leader contacts the Replica again.
    
    This suggests that preemptive snapshots introduce fundamental complexity
    which we'd like to avoid - as long as we use preemptive snapshots there
    will not be sanity in this department.
    
    This PR introduces a mechanism which delays the application of
    preemptive snapshots so that we apply them only when the first request
    *after* the completed configuration change comes in (at which point a
    replicaID is present).
    
    Superficially, this seems to solve the above problem (since the Replica
    will only be instantiated the moment a replicaID is known), though it
    doesn't do so across restarts.
    
    However, if we synchronously persisted (not done in this PR) the
    replicaID from incoming Raft messages whenever it changed, it seems that
    we should always be able to assign a replicaID when creating a Replica,
    even when dealing with descriptors that don't contain the replica itself
    (since there would've been a Raft message with a replicaID at some
    point, and we persist that). This roughly corresponds to persisting
    `Replica.lastToReplica`.
    
    We ultimately want to switch to learner replicas instead of preemptive
    snapshots. Learner replicas have the advantage that they are always
    represented in the replica descriptor, and so the snapshot that
    initializes them will be a proper Raft snapshot containing a descriptor
    containing the learner Replica itself. However, it's clear that we need
    to continue supporting preemptive snapshots in 19.2 due to the need to
    support mixed 19.1/19.2 clusters.
    
    This PR in conjunction with persisting the replicaID (and auxiliary
    work, for example on the split lock which currently also creates a
    replica with replicaID zero and which we know [has bugs]) should allow
    us to remove replicaID zero from the code base without waiting out the
    19.1 release cycle.
    
    [has bugs]: cockroachdb#21146
    
    Release note: None
    tbg committed Apr 9, 2019
    Configuration menu
    Copy the full SHA
    b8fc30a View commit details
    Browse the repository at this point in the history