Skip to content

Commit

Permalink
storage: improve semantics of desc.Generation
Browse files Browse the repository at this point in the history
by propagating the new generation of the LHS of a split to the RHS and
by taking into account the generation of the RHS on merges, we can
compare generations between overlapping replicas to decide which one
is stale.

Depending on whether we allow anyone from upgrading from 19.1-rcX
into 19.1 in a production setting, we won't be able to use these
semantics without a separate migration that puts additional state
on the range descriptor (which would be nice to avoid).

See cockroachdb#36611 for context.

Release note: None
  • Loading branch information
tbg committed Apr 10, 2019
1 parent 76e8e78 commit 2422c42
Show file tree
Hide file tree
Showing 4 changed files with 168 additions and 19 deletions.
103 changes: 86 additions & 17 deletions pkg/roachpb/metadata.pb.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

71 changes: 70 additions & 1 deletion pkg/roachpb/metadata.proto
Original file line number Diff line number Diff line change
Expand Up @@ -95,12 +95,81 @@ message RangeDescriptor {

// generation is incremented on every split and every merge, i.e., whenever
// the end_key of this range changes. It is initialized to zero when the range
// is first created.
// is first created. The generation counter was first introduced to allow the
// range descriptor resulting from a split and then merge to be distinguishable
// from the initial range descriptor. This is important since changes to the
// range descriptors use CPuts to ensure mutual exclusion.
//
// See #28071 for details on the above.
//
// Generations are also useful to make local replicaGC decisions when applying
// a snapshot on keyspace that has overlapping replicas (but note that we do
// not use this at the time of writing due to migration concerns; see below).
//
// We want to be able to compare the snapshot range's generation counter to
// that of the overlapping replicas to draw a conclusion about whether the
// snapshot can be applied (in which case the overlapping replicas need to be
// safely removable). To that end, on a split, not only do we increment the
// left hand side's generation, we also copy the resultant generation to the
// newly created right hand side. On merges, we update the left hand side's
// generation so that it exceeds by one the maximum of the left hand side and
// the right hand side's generations from before the merge.
//
// If two replicas (perhaps one of them represented by a raft or preemptive
// snapshot) as defined by their full range descriptor (including, notably,
// the generation) overlap, then one of them has to be stale. This is because
// the keyspace cleanly shards into non-overlapping ranges at all times (i.e.
// for all consistent snapshots). Since meta ranges (or more generally, range
// descriptors) are only ever updated transactionally, mutations to the meta
// ranges can be serialized (i.e. put into some sequential ordering). We know
// that the descriptors corresponding to both of our replicas can't be from
// the same consistent snapshot of the meta ranges, so there is a version of
// the meta ranges that includes only the first replica, and there is a
// version that includes only the second replica. Without loss of generality,
// assume that the first version is "older". This means that there is a finite
// sequence of splits and merges that were applied to the consistent snapshot
// corresponding to the first version which resulted in the second version of
// the meta ranges.
//
// Each individual operation, thanks to the generational semantics above, has
// the invariant that the resulting descriptors have a strictly larger
// generation than any descriptors from the previous version that they cover.
// For example, if a descriptor [a,c) at generation 5 is split into [a,b) and
// [b,c), both of those latter range descriptors have generation 6. If [c,d)
// is at generation 12 and [d, f) is at generation 17, then the resulting
// merged range [c,f) will have generation 18.
//
// At the end of the day, for incoming snapshots, this means that we only have
// to collect the overlapping replicas and their generations. Any replica with
// a smaller generation is stale by the above argument and can be replicaGC'ed
// right away. Any replica with a larger generation indicates that the snapshot
// is stale and should be discarded. A replica with the same generation is
// necessarily a replica of the range the snapshot is addressing (this is the
// usual case, in which a snapshot "overlaps" precisely one replica, which is
// the replica it's supposed to update, and no splits and merges have taken
// place at all).
//
// Note that the generation counter is not incremented by versions of
// Cockroach prior to v2.1. To maintain backwards compatibility with these old
// versions of Cockroach, we cannot enable the gogoproto.nullable option, as
// we need to be able to encode this mesage with the generation field unset.
//
// Note also that when the generation counter was first introduced, it only
// ever incremented (by one) the generation of the left hand side on merges
// and splits, so the above overlap arguments only hold if we know that the
// descriptors involved never used that code. Generations were first introduced
// in the 19.1 release, though, the behavior described here was only introduced
// in a late release candidate. If we allow such a release candidate cluster
// to transition into the final 19.1 release, we will need to introduce
// additional state to mark descriptors as obeying the new rules. If we don't,
// then we are free to assume that the semantics always hold.
//
// For a third note, observe that the generational semantics above may
// possibly allow range merges without colocation, at least in the sense that
// the counter examples in #28071 are defused. This is because the
// generational counter can answer the question whether the overlapping
// replica is gc'able or not. If it is not gc'able, then by definition the
// replica applying the merge is.
optional int64 generation = 6;
}

Expand Down
11 changes: 10 additions & 1 deletion pkg/storage/replica_command.go
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,11 @@ func (r *Replica) adminSplitWithDescriptor(
leftDesc.IncrementGeneration()
leftDesc.EndKey = splitKey

// Set the generation of the right hand side descriptor to match that of the
// (updated) left hand side. See the comment on the field for an explanation
// of why generations are useful.
rightDesc.Generation = leftDesc.Generation

var extra string
if delayable {
extra += maybeDelaySplitToAvoidSnapshot(ctx, (*splitDelayHelper)(r))
Expand Down Expand Up @@ -379,7 +384,11 @@ func (r *Replica) AdminMerge(
if err := r.store.DB().GetProto(ctx, rightDescKey, &rightDesc); err != nil {
return reply, roachpb.NewError(err)
}

// lhs.Generation = max(rhs.Generation, lhs.Generation)+1.
// See the comment on the Generation field for why generation are useful.
if updatedLeftDesc.GetGeneration() > rightDesc.GetGeneration() {
updatedLeftDesc.Generation = rightDesc.Generation
}
updatedLeftDesc.IncrementGeneration()
updatedLeftDesc.EndKey = rightDesc.EndKey
log.Infof(ctx, "initiating a merge of %s into this range (%s)", rightDesc, reason)
Expand Down
2 changes: 2 additions & 0 deletions pkg/storage/store_snapshot.go
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,8 @@ func (s *Store) canApplySnapshot(
func (s *Store) canApplySnapshotLocked(
ctx context.Context, snapHeader *SnapshotRequest_Header, authoritative bool,
) (*ReplicaPlaceholder, error) {
// TODO(tbg): see the comment on desc.Generation for what seems to be a much
// saner way to handle overlap via generational semantics.
desc := *snapHeader.State.Desc

// First, check for an existing Replica.
Expand Down

0 comments on commit 2422c42

Please sign in to comment.