-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: set destroy status before destroying data on subsumed replicas #55477
kv: set destroy status before destroying data on subsumed replicas #55477
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 new invariant!
Reviewed 1 of 1 files at r1, 6 of 6 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
especially now that we allow follower reads through on subsumed ranges as of #51594.
Note that #51594 only allows historical (i.e. old enough to be outside the uncertainty window of the subsumption) reads on the subsumed leaseholder.
To clarify, were we trying to solve the following hazard with this PR (comment snippet lifted from clearSubsumedReplicaDiskData
):
// a b c d
// |---1---|-------2-------| S1
// |---1-------------------| S2
// |---1-----------|---3---| S3
//
// Since the merge is the first operation to happen, a follower could be down
// before it completes. It is reasonable for a snapshot for r1 from S3 to
// subsume both r1 and r2 in S1.
- S3 sends a snapshot for r1 to S1.
- We clear out the disk data for r2 on S1, but haven't applied the snapshot yet
- S1 serves incorrect follower-reads for r2 before snapshot is applied
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/store_merge.go, line 55 at r2 (raw file):
if err := s.removeInitializedReplicaRaftMuLocked(ctx, rightRepl, rightDesc.NextReplicaID, RemoveOptions{ // The replica was destroyed when the merge applied by to the tombstones // added to the batch in runPreApplyTriggersAfterStagingWriteBatch.
garbled sentence?
The TODO discussed a merge followed by a split and alluded to an optimization that could be used to avoid a snapshot. This doesn't make sense to me. I think that's because the comment was written back in a time where a range merged into and then back out of its left hand neighbor would be the same range with the same ID. This is not the case anymore, so I don't see how we could ever (reasonably) avoid a snapshot in such a case.
This PR cleans up some handling around replica destruction that scared me when working on cockroachdb#46329 and cockroachdb#55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of cockroachdb#51594. This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to `destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also assert that if `RemoveOptions.DestroyData` is false when passed to `removeInitializedReplicaRaftMuLocked`, then the destroy status is also set. This unification allows us to remove the `ignoreDestroyStatus` flag on `RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier. I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and cockroachdb#51594. Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice.
62e1eb6
to
8142440
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, were we trying to solve the following hazard with this PR (comment snippet lifted from clearSubsumedReplicaDiskData):
Yes, that's exactly the hazard we are trying to solve.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @ajwerner)
pkg/kv/kvserver/store_merge.go, line 55 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
garbled sentence?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15 and @ajwerner)
TFTRs! bors r+ |
Build succeeded: |
This PR cleans up some handling around replica destruction that scared me when working on #46329 and #55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see
clearSubsumedReplicaDiskData
). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of #51594.This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to
destroyReasonRemoved
before callingpreDestroyRaftMuLocked
. We also assert that ifRemoveOptions.DestroyData
is false when passed toremoveInitializedReplicaRaftMuLocked
, then the destroy status is also set.This unification allows us to remove the
ignoreDestroyStatus
flag onRemoveOptions
, whose meaning is now exactly the inverse ofDestroyData
. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier.I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and #51594.
Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice.