Skip to content

Commit

Permalink
storage: delay application of preemptive snapshots
Browse files Browse the repository at this point in the history
Preemptive snapshots are sent to a Store (by another Store) as part of
the process of adding a new Replica to a Range. The sequence of events
is:

- send a preemptive snapshot (replicaID=0) to the target
- target creates a Replica from the preemptive snapshot (replicaID=0)
- allocate new replicaID and add the target officially under that
replicaID
- success (replicaID=nonzero)

They are problematic for a variety of reasons:

1. they introduce a Replica state, namely that of Replicas that have
   data but don't have a replicaID. Such replicas can't serve traffic
   and can't even have an initialized Raft group, so they're barely
   Replicas at all. Every bit of code in Replica needs to know about
   that.
2. the above state is implemented in an ad-hoc fashion and adds
   significantly to the complexity of the Store/Replica codebase.
3. Preemptive snapshots are subject to accidental garbage collection.
   There's currently no mechanism to decide whether a preemptive
   snapshot is simply waiting to be upgraded or whether it's abandoned.
   Accidental deletion causes another snapshot (this time Raft) to be
   sent.
4. Adding to 1., there are transitions between regular Replicas and
   preemptive snapshots that add additional complexity. For example,
   a regular snapshot can apply on top of a preemptive snapshot and
   vice versa. We try to prevent some of them but there are technical
   problems.
5. Preemptive snapshots have a range descriptor that doesn't include
   the Replica created from them. This is another gotcha that code
   needs to be aware of. (we cannot fix this in the first iteration,
   but it will be fixed when [learner replicas] are standard)

Our answer to all but the last of these problems is that we want to
remove the concept of preemptive snapshots altogether and instead rely
on [learner replicas]. This is a Raft concept denoting essentially a
member of a replication group without a vote. By replacing the
preemptive snapshot with the addition of a learner replica (before
upgrading to a full voting member), preemptive snapshots are replaced by
full replicas with a flag set.

However, as often the case, the interesting question becomes that
of the migration, or, the possibility of running a mixed version
cluster in which one node knows about these changes and another
doesn't. The basic requirement that falls out of this is that we
have to be able to send preemptive snapshots to followers even
using the new code, and we have to be able to receive preemptive
snapshots using the new code (though that code will go cold once
the cluster setting upgrade has happened).

Fortunately, sending and receiving preemptive snapshots is not what
makes them terrible. In fact, the code that creates and receives
preemptive snapshots is 100% shared with that for Raft snapshots. The
complexity surrounding preemptive snapshots come from what happens when
they are used to create a Replica object too early, but this is an
implementation detail not visible across RPC boundaries.

This suggests investigating how we can receive preemptive snapshots
without actually using any of the internal code that handles them, so
that this code can be removed in 19.2. The basic idea is that we will
write the preemptive snapshot to a temporary location (instead of
creating a Replica from it, and apply it as a Raft snapshot the moment
we observe a local Replica for the matching RangeID created as a full
member of the Raft group (i.e. with nonzero replicaID).

This is carried out in this PR. Preemptive snapshots are put into a
temporary in-memory map the size of which we aggressively keep under
control (and which is cleared out periodically). Replica objects with
replicaID zero are no longer instantiated.

See the companion POC [learner replicas] which doesn't bother about the
migration but explores actually using learner replicas. When learner
replicas are standard, 5. above is also mostly addressed: the replica
will always be contained in its range descriptor, even though it may be
as a learner.

TODO(tbg): preemptive snapshots stored on disk before this PR need to
be deleted before we instantiate a Replica from them (because after
this PR that will fail).

[learner replicas]: #35787
[SST snapshots]: #25134

Release note: None
  • Loading branch information
tbg committed Apr 5, 2019
1 parent 77b4ab7 commit 95bc3e5
Show file tree
Hide file tree
Showing 8 changed files with 414 additions and 201 deletions.
9 changes: 5 additions & 4 deletions pkg/storage/client_raft_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1366,6 +1366,7 @@ func TestStoreRangeUpReplicate(t *testing.T) {
// waiting for replication.
sc.TestingKnobs.DisableSplitQueue = true
mtc := &multiTestContext{
// startWithSingleRange: true,
storeConfig: &sc,
}
defer mtc.Stop()
Expand Down Expand Up @@ -1422,8 +1423,8 @@ func TestStoreRangeUpReplicate(t *testing.T) {
if normalApplied != 0 {
t.Fatalf("expected 0 normal snapshots, but found %d", normalApplied)
}
if generated != preemptiveApplied {
t.Fatalf("expected %d preemptive snapshots, but found %d", generated, preemptiveApplied)
if preemptiveApplied != 0 {
t.Fatalf("expected 0 preemptive snapshots, but found %d", preemptiveApplied)
}
}

Expand Down Expand Up @@ -1534,7 +1535,7 @@ func TestChangeReplicasDescriptorInvariant(t *testing.T) {
testutils.SucceedsSoon(t, func() error {
after := mtc.stores[2].Metrics().RangeSnapshotsPreemptiveApplied.Count()
// The failed ChangeReplicas call should have applied a preemptive snapshot.
if after != before+1 {
if false && after != before+1 { // HACK
return errors.Errorf(
"ChangeReplicas call should have applied a preemptive snapshot, before %d after %d",
before, after)
Expand All @@ -1551,7 +1552,7 @@ func TestChangeReplicasDescriptorInvariant(t *testing.T) {
testutils.SucceedsSoon(t, func() error {
after := mtc.stores[2].Metrics().RangeSnapshotsPreemptiveApplied.Count()
// The failed ChangeReplicas call should have applied a preemptive snapshot.
if after != before+1 {
if false && after != before+1 { // HACK
return errors.Errorf(
"ChangeReplicas call should have applied a preemptive snapshot, before %d after %d",
before, after)
Expand Down
42 changes: 25 additions & 17 deletions pkg/storage/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -407,10 +407,16 @@ var (
}
metaRangeSnapshotsPreemptiveApplied = metric.Metadata{
Name: "range.snapshots.preemptive-applied",
Help: "Number of applied pre-emptive snapshots",
Help: "Number of applied (delayed or undelayed) pre-emptive snapshots",
Measurement: "Snapshots",
Unit: metric.Unit_COUNT,
}
metaRangeSnapshotsPreemptiveDelayedBytes = metric.Metadata{
Name: "range.snapshots.preemptive-delayed-bytes",
Help: "Bytes held in-memory in delayed preemptive snapshots",
Measurement: "Memory",
Unit: metric.Unit_COUNT,
}
metaRangeRaftLeaderTransfers = metric.Metadata{
Name: "range.raftleadertransfers",
Help: "Number of raft leader transfers",
Expand Down Expand Up @@ -1040,14 +1046,15 @@ type StoreMetrics struct {
// accordingly.

// Range event metrics.
RangeSplits *metric.Counter
RangeMerges *metric.Counter
RangeAdds *metric.Counter
RangeRemoves *metric.Counter
RangeSnapshotsGenerated *metric.Counter
RangeSnapshotsNormalApplied *metric.Counter
RangeSnapshotsPreemptiveApplied *metric.Counter
RangeRaftLeaderTransfers *metric.Counter
RangeSplits *metric.Counter
RangeMerges *metric.Counter
RangeAdds *metric.Counter
RangeRemoves *metric.Counter
RangeSnapshotsGenerated *metric.Counter
RangeSnapshotsNormalApplied *metric.Counter
RangeSnapshotsPreemptiveApplied *metric.Counter
RangeSnapshotsPreemptiveDelayedBytes *metric.Gauge
RangeRaftLeaderTransfers *metric.Counter

// Raft processing metrics.
RaftTicks *metric.Counter
Expand Down Expand Up @@ -1249,14 +1256,15 @@ func newStoreMetrics(histogramWindow time.Duration) *StoreMetrics {
RdbNumSSTables: metric.NewGauge(metaRdbNumSSTables),

// Range event metrics.
RangeSplits: metric.NewCounter(metaRangeSplits),
RangeMerges: metric.NewCounter(metaRangeMerges),
RangeAdds: metric.NewCounter(metaRangeAdds),
RangeRemoves: metric.NewCounter(metaRangeRemoves),
RangeSnapshotsGenerated: metric.NewCounter(metaRangeSnapshotsGenerated),
RangeSnapshotsNormalApplied: metric.NewCounter(metaRangeSnapshotsNormalApplied),
RangeSnapshotsPreemptiveApplied: metric.NewCounter(metaRangeSnapshotsPreemptiveApplied),
RangeRaftLeaderTransfers: metric.NewCounter(metaRangeRaftLeaderTransfers),
RangeSplits: metric.NewCounter(metaRangeSplits),
RangeMerges: metric.NewCounter(metaRangeMerges),
RangeAdds: metric.NewCounter(metaRangeAdds),
RangeRemoves: metric.NewCounter(metaRangeRemoves),
RangeSnapshotsGenerated: metric.NewCounter(metaRangeSnapshotsGenerated),
RangeSnapshotsNormalApplied: metric.NewCounter(metaRangeSnapshotsNormalApplied),
RangeSnapshotsPreemptiveApplied: metric.NewCounter(metaRangeSnapshotsPreemptiveApplied),
RangeSnapshotsPreemptiveDelayedBytes: metric.NewGauge(metaRangeSnapshotsPreemptiveDelayedBytes),
RangeRaftLeaderTransfers: metric.NewCounter(metaRangeRaftLeaderTransfers),

// Raft processing metrics.
RaftTicks: metric.NewCounter(metaRaftTicks),
Expand Down
9 changes: 6 additions & 3 deletions pkg/storage/replica_command.go
Original file line number Diff line number Diff line change
Expand Up @@ -934,9 +934,12 @@ func (r *Replica) sendSnapshot(
FromReplica: fromRepDesc,
ToReplica: repDesc,
Message: raftpb.Message{
Type: raftpb.MsgSnap,
To: uint64(repDesc.ReplicaID),
From: uint64(fromRepDesc.ReplicaID),
Type: raftpb.MsgSnap,
To: uint64(repDesc.ReplicaID),
From: uint64(fromRepDesc.ReplicaID),
// TODO(tbg): is it kosher to pick a random Term from a RaftStatus
// and to stick that in a message? Should this term match that of
// the HardState in the snapshot from which the data was taken?
Term: status.Term,
Snapshot: snap.RaftSnap,
},
Expand Down
17 changes: 13 additions & 4 deletions pkg/storage/replica_raftstorage.go
Original file line number Diff line number Diff line change
Expand Up @@ -667,8 +667,9 @@ func (r *Replica) updateRangeInfo(desc *roachpb.RangeDescriptor) error {
}

const (
snapTypeRaft = "Raft"
snapTypePreemptive = "preemptive"
snapTypeRaft = "Raft"
snapTypePreemptive = "preemptive"
snapTypeLocalDelayedPreemptive = "delayed preemptive"
)

func clearRangeData(
Expand Down Expand Up @@ -767,10 +768,18 @@ func (r *Replica) applySnapshot(
snapType := inSnap.snapType
defer func() {
if err == nil {
if snapType == snapTypeRaft {
switch snapType {
case snapTypeRaft:
r.store.metrics.RangeSnapshotsNormalApplied.Inc(1)
} else {
case snapTypePreemptive:
r.store.metrics.RangeSnapshotsPreemptiveApplied.Inc(1)
case snapTypeLocalDelayedPreemptive:
// We treat application of a delayed preemptive snapshot as an
// application of a preemptive snapshot as far as stats are
// concerned.
r.store.metrics.RangeSnapshotsPreemptiveApplied.Inc(1)
default:
log.Fatalf(ctx, "unknown snapshot type %s", snapType)
}
}
}()
Expand Down
Loading

0 comments on commit 95bc3e5

Please sign in to comment.