storage: widening Raft snapshot spanning merge may clobber newer range via subsumption #36611

tbg · 2019-04-08T13:50:21Z

I was thinking about the safety properties of preemptive snapshots, learner snapshots, and Raft snapshots (for #35786 and #35787) with and without merges today and I arrived at an example which I think we mishandle. The TL;DR is that I think that we'll accept a Raft snapshot that extends an initialized replica and "overwrites" a newer right hand side that was merged in but then split out again (and that may have new data).

initially s{1,2,3} have r1=[a,c)
the replica on s2 is removed
r1 merges its right neighbor r2=[c,d)
s2/r1 is re-added (no preemptive snap)
(so it needs a Raft snapshot)
a Raft snapshot is generated (say on s1)
r1 splits into r1=[a,c) and r3=[c,d)
s2 automatically has a replica s2/r3 which catches up ([c,d) was empty on s2)
(s2 still has its stale replica for r1=[a,c) and an incoming Raft snapshot to it with bounds [a,d))
a number of commands are proposed on r3 (in particular, they commit and apply on s2/r3).
the snapshot for r1 is fully received at s2. We want this snapshot to be rejected,
for it spans [a,d) and thus overlaps the replica of r3=[c,d) which has new writes
but I think we would actually apply the snapshot: the existing replica s2/r1 is initialized
and so we hit the early return in 1. Applying the snapshot, we GC the RHS and write a tombstone with nextReplicaID=infinity, effectively taking that replica down. And then the LHS catches up across the split trigger and likely crashes the node permanently, because the RHS replica can't be instantiated any more thanks to the range tombstone.

This problem would be avoided if (edit: this statement about generations is wrong) we checked the generations of all subsumed replicas before
handing the snapshot to Raft (i.e. in canApplySnapshot), dropping the snapshot unless they're all smaller than the snapshot's generation. The problem would also be avoided
if we didn't allow existing replicas to change their replicaID in place. That is, upon trying to apply
the last snapshot, the existing replica would be gc'ed before checking whether the snapshot
could be applied to a new, uninitialized replica. This uninitialized replica would then reject
the snapshot based on its overlapping another replica. In both cases the result would be another
snapshot being sent.

I'd appreciate if someone (@bdarnell or @nvanbenschoten?) gave this a close reading to check my understanding.

The text was updated successfully, but these errors were encountered:

tbg · 2019-04-08T14:08:43Z

On a meta level, I think the comment that "Raft snapshots to initialized replicas can't be refused" which first originated in https://github.com/cockroachdb/cockroach/pull/28683/files isn't quite right. We can always refuse a Raft snapshot, the question is whether there will ever be a snapshot for that range that we won't drop.

I think there's really only one situation in which refusing the snapshot would end us in a loop, and it's the situation outlined in b728fcd:

Raft snapshots can "widen" an existing replica. Consider a merge where
one of the LHS replicas goes offline immediately before the merge
commits. Before this replica comes back online, the leader truncates the
log. When this range comes back online, it will receive a Raft snapshot
that postdates the merge application.

To apply this snapshot, the receiving replica needs to susume its copy
of the RHS during snapshot application. Note that it is guaranteed to
have a copy of the RHS, as merges cannot commit unless the ranges are
collocated and the replica GC queue is careful to never GC a replica
that could be required for subsumption.

Note that in this scenario the replicaID didn't change.

tbg · 2019-04-08T14:37:35Z

I was wrong in thinking that the generations are helpful here. They generally only make sense when comparing generations for the same rangeID. This is because the RHS of a split always starts out at generation zero. If the RHS inherited the generation of the LHS, maybe something can work but I haven't thought about it.

bdarnell · 2019-04-08T15:31:41Z

That looks right to me.

My initial sense is that the in-place changes of replica ID have always been a little sketchy and we should aim to eliminate them as much as possible. We should try to go through a GC/snapshot cycle instead of reusing the data that is already there. (much of the need for in-place replica ID changes came from a time when the GC/snapshot processes were more expensive, and replicas that needed GC would linger for a longer time)

nvanbenschoten · 2019-04-08T15:59:54Z

If the RHS inherited the generation of the LHS, maybe something can work but I haven't thought about it.

I figured this was already how this worked. I would expect that both sides of a split would get the generation lhs.gen + 1 and the result of a merge would get the generation max(lhs.gen, rhs.gen)+1. This way, we could compare the generation for any key ranges that overlap to determine who is new and who is stale. We could then immediately GC stale Ranges and reject stale snapshots. I haven't thought about this enought to understand the challenges here though.

by propagating the new generation of the LHS of a split to the RHS and by taking into account the generation of the RHS on merges, we can compare generations between overlapping replicas to decide which one is stale. Depending on whether we allow anyone from upgrading from 19.1-rcX into 19.1 in a production setting, we won't be able to use these semantics without a separate migration that puts additional state on the range descriptor (which would be nice to avoid). See cockroachdb#36611 for context. Release note: None

tbg · 2019-04-09T12:17:53Z

I sent PR #36654 to change the semantics -- I hope we can get this in for 19.1. Let's discuss there.

@bdarnell

36654: storage: improve semantics of desc.Generation r=bdarnell,nvanbenschoten a=tbg by propagating the new generation of the LHS of a split to the RHS and by taking into account the generation of the RHS on merges, we can compare generations between overlapping replicas to decide which one is stale. Depending on whether we allow anyone from upgrading from 19.1-rcX into 19.1 in a production setting, we won't be able to use these semantics without a separate migration that puts additional state on the range descriptor (which would be nice to avoid). @bdarnell the above supposes that we *will* backport this to 19.1 before the release. I do hope that this is possible because the new semantics are so much better and not doing it now means we have to fight a bit of an uphill battle to get to the point where we know they're true. Let me know what you think. See #36611 for context. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

by propagating the new generation of the LHS of a split to the RHS and by taking into account the generation of the RHS on merges, we can compare generations between overlapping replicas to decide which one is stale. Depending on whether we allow anyone from upgrading from 19.1-rcX into 19.1 in a production setting, we won't be able to use these semantics without a separate migration that puts additional state on the range descriptor (which would be nice to avoid). See cockroachdb#36611 for context. Release note: None

tbg · 2019-06-20T11:01:19Z

cc @danhhz, this is the issue I talked about yesterday but couldn't find on the spot

tbg · 2019-07-25T12:38:26Z

Here's the example in schematic form, I find this easier to follow along with.

   a      b      c      d
S1
n1 |--1----------|-2----|
n2 |--1--(rmvd)--|
n3 |--1----------|-2----|

S2
n1 |--1----------|-2----|
n2 |--1--(rmvd)--|
n3 |--1----------|-2----|

S3
n1 |--1-----------------|
n2 |--1--(rmvd)--|        <-- r1@S3 snap incoming
n3 |--1-----------------|

S4
n1 |--1----------|-3----|
n2 |--1--(rmvd)--|        <-- r1@S3 snap incoming
n3 |--1----------|-3----|

S5
n1 |--1----------|-3----|
n2 |--1--(rmvd)--|-3----| <-- r1@S3 snap incoming
n3 |--1----------|-3----|

S6
n1 |--1----------|-3----|
n2 |--r1@S3-------------| <-- data lost from r3
n3 |--1----------|-3----|

The code that erroneously accepts r1@S3 is

cockroach/pkg/storage/store_snapshot.go

Lines 459 to 472 in f8be509

    
           if existingIsInitialized { 
        
           	if !snapHeader.IsPreemptive() { 
        
           		// Regular Raft snapshots can't be refused at this point, 
        
           		// even if they widen the existing replica. See the comments 
        
           		// in Replica.maybeAcquireSnapshotMergeLock for how this is 
        
           		// made safe. 
        
           		// 
        
           		// NB: we expect the replica to know its replicaID at this point 
        
           		// (i.e. !existingIsPreemptive), though perhaps it's possible 
        
           		// that this isn't true if the leader initiates a Raft snapshot 
        
           		// (that would provide a range descriptor with this replica in 
        
           		// it) but this node reboots (temporarily forgetting its 
        
           		// replicaID) before the snapshot arrives. 
        
           		return nil, nil

Looking at the example one could argue that maybe the creation of r3@S3 should have gotten blocked instead, but it can only be blocked once we know that we're receiving r1@S3 which we simply may not know for a while (for the usual distributed systems reasons).

nvanbenschoten · 2020-04-22T01:31:48Z

I believe this issue was solved by @ajwerner somewhere in the combination of 2020115 and b2850e1. We no longer allow replicas to change their replica IDs in place (as of 19.2) and we also no longer support preemptive snapshots (as of 20.1).

Should we close this?

ajwerner · 2020-04-22T04:30:24Z

Yes

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 8, 2019

tbg self-assigned this Apr 8, 2019

tbg mentioned this issue Apr 9, 2019

storage: improve semantics of desc.Generation #36654

Merged

tbg mentioned this issue Apr 10, 2019

backport-19.1: storage: improve semantics of desc.Generation #36718

Closed

tbg changed the title ~~storage: widening Raft snapshot may clobber newer range via subsumption~~ storage: widening Raft snapshot spanning merges may clobber newer range via subsumption Jul 25, 2019

tbg changed the title ~~storage: widening Raft snapshot spanning merges may clobber newer range via subsumption~~ storage: widening Raft snapshot spanning merge may clobber newer range via subsumption Jul 25, 2019

tbg mentioned this issue Jul 25, 2019

storage: build SSTs from KV_BATCH snapshot #38932

Merged

2 tasks

tbg mentioned this issue Aug 27, 2019

storage: crash with snapshot widens existing replica, but no replica exists for subsumed key #40257

Closed

ajwerner closed this as completed Apr 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: widening Raft snapshot spanning merge may clobber newer range via subsumption #36611

storage: widening Raft snapshot spanning merge may clobber newer range via subsumption #36611

tbg commented Apr 8, 2019 •

edited

Loading

tbg commented Apr 8, 2019 •

edited

Loading

tbg commented Apr 8, 2019

bdarnell commented Apr 8, 2019

nvanbenschoten commented Apr 8, 2019

tbg commented Apr 9, 2019

tbg commented Jun 20, 2019

tbg commented Jul 25, 2019

nvanbenschoten commented Apr 22, 2020

ajwerner commented Apr 22, 2020

storage: widening Raft snapshot spanning merge may clobber newer range via subsumption #36611

storage: widening Raft snapshot spanning merge may clobber newer range via subsumption #36611

Comments

tbg commented Apr 8, 2019 • edited Loading

tbg commented Apr 8, 2019 • edited Loading

tbg commented Apr 8, 2019

bdarnell commented Apr 8, 2019

nvanbenschoten commented Apr 8, 2019

tbg commented Apr 9, 2019

tbg commented Jun 20, 2019

tbg commented Jul 25, 2019

nvanbenschoten commented Apr 22, 2020

ajwerner commented Apr 22, 2020

tbg commented Apr 8, 2019 •

edited

Loading

tbg commented Apr 8, 2019 •

edited

Loading