-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: widening Raft snapshot spanning merge may clobber newer range via subsumption #36611
Comments
On a meta level, I think the comment that "Raft snapshots to initialized replicas can't be refused" which first originated in https://github.com/cockroachdb/cockroach/pull/28683/files isn't quite right. We can always refuse a Raft snapshot, the question is whether there will ever be a snapshot for that range that we won't drop. I think there's really only one situation in which refusing the snapshot would end us in a loop, and it's the situation outlined in b728fcd:
Note that in this scenario the replicaID didn't change. |
I was wrong in thinking that the generations are helpful here. They generally only make sense when comparing generations for the same rangeID. This is because the RHS of a split always starts out at generation zero. If the RHS inherited the generation of the LHS, maybe something can work but I haven't thought about it. |
That looks right to me. My initial sense is that the in-place changes of replica ID have always been a little sketchy and we should aim to eliminate them as much as possible. We should try to go through a GC/snapshot cycle instead of reusing the data that is already there. (much of the need for in-place replica ID changes came from a time when the GC/snapshot processes were more expensive, and replicas that needed GC would linger for a longer time) |
I figured this was already how this worked. I would expect that both sides of a split would get the generation |
by propagating the new generation of the LHS of a split to the RHS and by taking into account the generation of the RHS on merges, we can compare generations between overlapping replicas to decide which one is stale. Depending on whether we allow anyone from upgrading from 19.1-rcX into 19.1 in a production setting, we won't be able to use these semantics without a separate migration that puts additional state on the range descriptor (which would be nice to avoid). See cockroachdb#36611 for context. Release note: None
I sent PR #36654 to change the semantics -- I hope we can get this in for 19.1. Let's discuss there. |
36654: storage: improve semantics of desc.Generation r=bdarnell,nvanbenschoten a=tbg by propagating the new generation of the LHS of a split to the RHS and by taking into account the generation of the RHS on merges, we can compare generations between overlapping replicas to decide which one is stale. Depending on whether we allow anyone from upgrading from 19.1-rcX into 19.1 in a production setting, we won't be able to use these semantics without a separate migration that puts additional state on the range descriptor (which would be nice to avoid). @bdarnell the above supposes that we *will* backport this to 19.1 before the release. I do hope that this is possible because the new semantics are so much better and not doing it now means we have to fight a bit of an uphill battle to get to the point where we know they're true. Let me know what you think. See #36611 for context. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
by propagating the new generation of the LHS of a split to the RHS and by taking into account the generation of the RHS on merges, we can compare generations between overlapping replicas to decide which one is stale. Depending on whether we allow anyone from upgrading from 19.1-rcX into 19.1 in a production setting, we won't be able to use these semantics without a separate migration that puts additional state on the range descriptor (which would be nice to avoid). See cockroachdb#36611 for context. Release note: None
cc @danhhz, this is the issue I talked about yesterday but couldn't find on the spot |
Here's the example in schematic form, I find this easier to follow along with.
The code that erroneously accepts r1@S3 is cockroach/pkg/storage/store_snapshot.go Lines 459 to 472 in f8be509
Looking at the example one could argue that maybe the creation of r3@S3 should have gotten blocked instead, but it can only be blocked once we know that we're receiving r1@S3 which we simply may not know for a while (for the usual distributed systems reasons). |
Yes |
I was thinking about the safety properties of preemptive snapshots, learner snapshots, and Raft snapshots (for #35786 and #35787) with and without merges today and I arrived at an example which I think we mishandle. The TL;DR is that I think that we'll accept a Raft snapshot that extends an initialized replica and "overwrites" a newer right hand side that was merged in but then split out again (and that may have new data).
(so it needs a Raft snapshot)
(s2 still has its stale replica for r1=[a,c) and an incoming Raft snapshot to it with bounds [a,d))
for it spans [a,d) and thus overlaps the replica of r3=[c,d) which has new writes
and so we hit the early return in 1. Applying the snapshot, we GC the RHS and write a tombstone with nextReplicaID=infinity, effectively taking that replica down. And then the LHS catches up across the split trigger and likely crashes the node permanently, because the RHS replica can't be instantiated any more thanks to the range tombstone.
This problem would be avoided if (edit: this statement about generations is wrong)
we checked the generations of all subsumed replicas beforewe didn't allow existing replicas to change their replicaID in place. That is, upon trying to applyhanding the snapshot to Raft (i.e. in
canApplySnapshot
), dropping the snapshot unless they're all smaller than the snapshot's generation. The problem would also be avoidedif
the last snapshot, the existing replica would be gc'ed before checking whether the snapshot
could be applied to a new, uninitialized replica. This uninitialized replica would then reject
the snapshot based on its overlapping another replica. In both cases the result would be another
snapshot being sent.
I'd appreciate if someone (@bdarnell or @nvanbenschoten?) gave this a close reading to check my understanding.
The text was updated successfully, but these errors were encountered: