Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: widening Raft snapshot spanning merge may clobber newer range via subsumption #36611

Closed
tbg opened this issue Apr 8, 2019 · 9 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Apr 8, 2019

I was thinking about the safety properties of preemptive snapshots, learner snapshots, and Raft snapshots (for #35786 and #35787) with and without merges today and I arrived at an example which I think we mishandle. The TL;DR is that I think that we'll accept a Raft snapshot that extends an initialized replica and "overwrites" a newer right hand side that was merged in but then split out again (and that may have new data).

  • initially s{1,2,3} have r1=[a,c)
  • the replica on s2 is removed
  • r1 merges its right neighbor r2=[c,d)
  • s2/r1 is re-added (no preemptive snap)
    (so it needs a Raft snapshot)
  • a Raft snapshot is generated (say on s1)
  • r1 splits into r1=[a,c) and r3=[c,d)
  • s2 automatically has a replica s2/r3 which catches up ([c,d) was empty on s2)
    (s2 still has its stale replica for r1=[a,c) and an incoming Raft snapshot to it with bounds [a,d))
  • a number of commands are proposed on r3 (in particular, they commit and apply on s2/r3).
  • the snapshot for r1 is fully received at s2. We want this snapshot to be rejected,
    for it spans [a,d) and thus overlaps the replica of r3=[c,d) which has new writes
  • but I think we would actually apply the snapshot: the existing replica s2/r1 is initialized
    and so we hit the early return in 1. Applying the snapshot, we GC the RHS and write a tombstone with nextReplicaID=infinity, effectively taking that replica down. And then the LHS catches up across the split trigger and likely crashes the node permanently, because the RHS replica can't be instantiated any more thanks to the range tombstone.

This problem would be avoided if (edit: this statement about generations is wrong) we checked the generations of all subsumed replicas before
handing the snapshot to Raft (i.e. in canApplySnapshot), dropping the snapshot unless they're all smaller than the snapshot's generation. The problem would also be avoided
if
we didn't allow existing replicas to change their replicaID in place. That is, upon trying to apply
the last snapshot, the existing replica would be gc'ed before checking whether the snapshot
could be applied to a new, uninitialized replica. This uninitialized replica would then reject
the snapshot based on its overlapping another replica. In both cases the result would be another
snapshot being sent.

I'd appreciate if someone (@bdarnell or @nvanbenschoten?) gave this a close reading to check my understanding.

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 8, 2019
@tbg tbg self-assigned this Apr 8, 2019
@tbg
Copy link
Member Author

tbg commented Apr 8, 2019

On a meta level, I think the comment that "Raft snapshots to initialized replicas can't be refused" which first originated in https://github.com/cockroachdb/cockroach/pull/28683/files isn't quite right. We can always refuse a Raft snapshot, the question is whether there will ever be a snapshot for that range that we won't drop.

I think there's really only one situation in which refusing the snapshot would end us in a loop, and it's the situation outlined in b728fcd:

Raft snapshots can "widen" an existing replica. Consider a merge where
one of the LHS replicas goes offline immediately before the merge
commits. Before this replica comes back online, the leader truncates the
log. When this range comes back online, it will receive a Raft snapshot
that postdates the merge application.

To apply this snapshot, the receiving replica needs to susume its copy
of the RHS during snapshot application. Note that it is guaranteed to
have a copy of the RHS, as merges cannot commit unless the ranges are
collocated and the replica GC queue is careful to never GC a replica
that could be required for subsumption.

Note that in this scenario the replicaID didn't change.

@tbg
Copy link
Member Author

tbg commented Apr 8, 2019

I was wrong in thinking that the generations are helpful here. They generally only make sense when comparing generations for the same rangeID. This is because the RHS of a split always starts out at generation zero. If the RHS inherited the generation of the LHS, maybe something can work but I haven't thought about it.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 8, 2019

That looks right to me.

My initial sense is that the in-place changes of replica ID have always been a little sketchy and we should aim to eliminate them as much as possible. We should try to go through a GC/snapshot cycle instead of reusing the data that is already there. (much of the need for in-place replica ID changes came from a time when the GC/snapshot processes were more expensive, and replicas that needed GC would linger for a longer time)

@nvanbenschoten
Copy link
Member

If the RHS inherited the generation of the LHS, maybe something can work but I haven't thought about it.

I figured this was already how this worked. I would expect that both sides of a split would get the generation lhs.gen + 1 and the result of a merge would get the generation max(lhs.gen, rhs.gen)+1. This way, we could compare the generation for any key ranges that overlap to determine who is new and who is stale. We could then immediately GC stale Ranges and reject stale snapshots. I haven't thought about this enought to understand the challenges here though.

tbg added a commit to tbg/cockroach that referenced this issue Apr 9, 2019
by propagating the new generation of the LHS of a split to the RHS and
by taking into account the generation of the RHS on merges, we can
compare generations between overlapping replicas to decide which one
is stale.

Depending on whether we allow anyone from upgrading from 19.1-rcX
into 19.1 in a production setting, we won't be able to use these
semantics without a separate migration that puts additional state
on the range descriptor (which would be nice to avoid).

See cockroachdb#36611 for context.

Release note: None
@tbg
Copy link
Member Author

tbg commented Apr 9, 2019

I sent PR #36654 to change the semantics -- I hope we can get this in for 19.1. Let's discuss there.

craig bot pushed a commit that referenced this issue Apr 10, 2019
36654: storage: improve semantics of desc.Generation r=bdarnell,nvanbenschoten a=tbg

by propagating the new generation of the LHS of a split to the RHS and
by taking into account the generation of the RHS on merges, we can
compare generations between overlapping replicas to decide which one
is stale.

Depending on whether we allow anyone from upgrading from 19.1-rcX
into 19.1 in a production setting, we won't be able to use these
semantics without a separate migration that puts additional state
on the range descriptor (which would be nice to avoid).

@bdarnell the above supposes that we *will* backport this to 19.1
before the release. I do hope that this is possible because the
new semantics are so much better and not doing it now means we
have to fight a bit of an uphill battle to get to the point where
we know they're true. Let me know what you think.

See #36611 for context.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Apr 10, 2019
by propagating the new generation of the LHS of a split to the RHS and
by taking into account the generation of the RHS on merges, we can
compare generations between overlapping replicas to decide which one
is stale.

Depending on whether we allow anyone from upgrading from 19.1-rcX
into 19.1 in a production setting, we won't be able to use these
semantics without a separate migration that puts additional state
on the range descriptor (which would be nice to avoid).

See cockroachdb#36611 for context.

Release note: None
@tbg
Copy link
Member Author

tbg commented Jun 20, 2019

cc @danhhz, this is the issue I talked about yesterday but couldn't find on the spot

@tbg tbg changed the title storage: widening Raft snapshot may clobber newer range via subsumption storage: widening Raft snapshot spanning merges may clobber newer range via subsumption Jul 25, 2019
@tbg tbg changed the title storage: widening Raft snapshot spanning merges may clobber newer range via subsumption storage: widening Raft snapshot spanning merge may clobber newer range via subsumption Jul 25, 2019
@tbg
Copy link
Member Author

tbg commented Jul 25, 2019

Here's the example in schematic form, I find this easier to follow along with.

   a      b      c      d
S1
n1 |--1----------|-2----|
n2 |--1--(rmvd)--|
n3 |--1----------|-2----|

S2
n1 |--1----------|-2----|
n2 |--1--(rmvd)--|
n3 |--1----------|-2----|

S3
n1 |--1-----------------|
n2 |--1--(rmvd)--|        <-- r1@S3 snap incoming
n3 |--1-----------------|

S4
n1 |--1----------|-3----|
n2 |--1--(rmvd)--|        <-- r1@S3 snap incoming
n3 |--1----------|-3----|

S5
n1 |--1----------|-3----|
n2 |--1--(rmvd)--|-3----| <-- r1@S3 snap incoming
n3 |--1----------|-3----|

S6
n1 |--1----------|-3----|
n2 |--r1@S3-------------| <-- data lost from r3
n3 |--1----------|-3----|

The code that erroneously accepts r1@S3 is

if existingIsInitialized {
if !snapHeader.IsPreemptive() {
// Regular Raft snapshots can't be refused at this point,
// even if they widen the existing replica. See the comments
// in Replica.maybeAcquireSnapshotMergeLock for how this is
// made safe.
//
// NB: we expect the replica to know its replicaID at this point
// (i.e. !existingIsPreemptive), though perhaps it's possible
// that this isn't true if the leader initiates a Raft snapshot
// (that would provide a range descriptor with this replica in
// it) but this node reboots (temporarily forgetting its
// replicaID) before the snapshot arrives.
return nil, nil

Looking at the example one could argue that maybe the creation of r3@S3 should have gotten blocked instead, but it can only be blocked once we know that we're receiving r1@S3 which we simply may not know for a while (for the usual distributed systems reasons).

@nvanbenschoten
Copy link
Member

I believe this issue was solved by @ajwerner somewhere in the combination of 2020115 and b2850e1. We no longer allow replicas to change their replica IDs in place (as of 19.2) and we also no longer support preemptive snapshots (as of 20.1).

Should we close this?

@ajwerner
Copy link
Contributor

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

4 participants