kv: set destroy status before destroying data on subsumed replicas #55477

nvanbenschoten · 2020-10-13T05:18:14Z

This PR cleans up some handling around replica destruction that scared me when working on #46329 and #55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see clearSubsumedReplicaDiskData). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of #51594.

This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to destroyReasonRemoved before calling preDestroyRaftMuLocked. We also assert that if RemoveOptions.DestroyData is false when passed to removeInitializedReplicaRaftMuLocked, then the destroy status is also set.

This unification allows us to remove the ignoreDestroyStatus flag on RemoveOptions, whose meaning is now exactly the inverse of DestroyData. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier.

I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and #51594.

Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice.

cockroach-teamcity · 2020-10-13T05:18:22Z

This change is

ajwerner

🎉 new invariant!

Reviewed 1 of 1 files at r1, 6 of 6 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

aayushshah15

especially now that we allow follower reads through on subsumed ranges as of #51594.

Note that #51594 only allows historical (i.e. old enough to be outside the uncertainty window of the subsumption) reads on the subsumed leaseholder.

To clarify, were we trying to solve the following hazard with this PR (comment snippet lifted from clearSubsumedReplicaDiskData):

	// a       b       c       d
	// |---1---|-------2-------|  S1
	// |---1-------------------|  S2
	// |---1-----------|---3---|  S3
	//
	// Since the merge is the first operation to happen, a follower could be down
	// before it completes. It is reasonable for a snapshot for r1 from S3 to
	// subsume both r1 and r2 in S1.

S3 sends a snapshot for r1 to S1.
We clear out the disk data for r2 on S1, but haven't applied the snapshot yet
S1 serves incorrect follower-reads for r2 before snapshot is applied

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/store_merge.go, line 55 at r2 (raw file):

	if err := s.removeInitializedReplicaRaftMuLocked(ctx, rightRepl, rightDesc.NextReplicaID, RemoveOptions{
		// The replica was destroyed when the merge applied by to the tombstones
		// added to the batch in runPreApplyTriggersAfterStagingWriteBatch.

garbled sentence?

The TODO discussed a merge followed by a split and alluded to an optimization that could be used to avoid a snapshot. This doesn't make sense to me. I think that's because the comment was written back in a time where a range merged into and then back out of its left hand neighbor would be the same range with the same ID. This is not the case anymore, so I don't see how we could ever (reasonably) avoid a snapshot in such a case.

This PR cleans up some handling around replica destruction that scared me when working on cockroachdb#46329 and cockroachdb#55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of cockroachdb#51594. This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to `destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also assert that if `RemoveOptions.DestroyData` is false when passed to `removeInitializedReplicaRaftMuLocked`, then the destroy status is also set. This unification allows us to remove the `ignoreDestroyStatus` flag on `RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier. I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and cockroachdb#51594. Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice.

nvanbenschoten

To clarify, were we trying to solve the following hazard with this PR (comment snippet lifted from clearSubsumedReplicaDiskData):

Yes, that's exactly the hazard we are trying to solve.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @ajwerner)

pkg/kv/kvserver/store_merge.go, line 55 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

garbled sentence?

Done.

aayushshah15

Thanks for clarifying.

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15 and @ajwerner)

nvanbenschoten · 2020-10-15T04:00:14Z

TFTRs!

bors r+

craig · 2020-10-15T04:37:38Z

Build succeeded:

GitHub CI (Cockroach)

nvanbenschoten requested review from ajwerner and aayushshah15 October 13, 2020 05:18

nvanbenschoten added the backport-20.2.x label Oct 13, 2020

ajwerner approved these changes Oct 13, 2020

View reviewed changes

aayushshah15 suggested changes Oct 13, 2020

View reviewed changes

nvanbenschoten added 2 commits October 14, 2020 19:28

nvanbenschoten force-pushed the nvanbenschoten/mergeDestroyStatus branch from 62e1eb6 to 8142440 Compare October 14, 2020 23:29

nvanbenschoten commented Oct 14, 2020

View reviewed changes

aayushshah15 approved these changes Oct 15, 2020

View reviewed changes

craig bot merged commit 80e7127 into cockroachdb:master Oct 15, 2020

nvanbenschoten deleted the nvanbenschoten/mergeDestroyStatus branch October 15, 2020 18:34

nvanbenschoten mentioned this pull request Oct 19, 2020

release-20.2: kv: set destroy status before destroying data on subsumed replicas #55691

Merged

rafiss added this to the 20.2 milestone Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: set destroy status before destroying data on subsumed replicas #55477

kv: set destroy status before destroying data on subsumed replicas #55477

nvanbenschoten commented Oct 13, 2020

cockroach-teamcity commented Oct 13, 2020

ajwerner left a comment

aayushshah15 left a comment

nvanbenschoten left a comment

aayushshah15 left a comment

nvanbenschoten commented Oct 15, 2020

craig bot commented Oct 15, 2020

kv: set destroy status before destroying data on subsumed replicas #55477

kv: set destroy status before destroying data on subsumed replicas #55477

Conversation

nvanbenschoten commented Oct 13, 2020

cockroach-teamcity commented Oct 13, 2020

ajwerner left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Oct 15, 2020

craig bot commented Oct 15, 2020