Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: set destroy status before destroying data on subsumed replicas #55477

Merged

Conversation

nvanbenschoten
Copy link
Member

This PR cleans up some handling around replica destruction that scared me when working on #46329 and #55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see clearSubsumedReplicaDiskData). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of #51594.

This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to destroyReasonRemoved before calling preDestroyRaftMuLocked. We also assert that if RemoveOptions.DestroyData is false when passed to removeInitializedReplicaRaftMuLocked, then the destroy status is also set.

This unification allows us to remove the ignoreDestroyStatus flag on RemoveOptions, whose meaning is now exactly the inverse of DestroyData. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier.

I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and #51594.

Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

🎉 new invariant!

Reviewed 1 of 1 files at r1, 6 of 6 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially now that we allow follower reads through on subsumed ranges as of #51594.

Note that #51594 only allows historical (i.e. old enough to be outside the uncertainty window of the subsumption) reads on the subsumed leaseholder.

To clarify, were we trying to solve the following hazard with this PR (comment snippet lifted from clearSubsumedReplicaDiskData):

	// a       b       c       d
	// |---1---|-------2-------|  S1
	// |---1-------------------|  S2
	// |---1-----------|---3---|  S3
	//
	// Since the merge is the first operation to happen, a follower could be down
	// before it completes. It is reasonable for a snapshot for r1 from S3 to
	// subsume both r1 and r2 in S1.
  1. S3 sends a snapshot for r1 to S1.
  2. We clear out the disk data for r2 on S1, but haven't applied the snapshot yet
  3. S1 serves incorrect follower-reads for r2 before snapshot is applied

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/store_merge.go, line 55 at r2 (raw file):

	if err := s.removeInitializedReplicaRaftMuLocked(ctx, rightRepl, rightDesc.NextReplicaID, RemoveOptions{
		// The replica was destroyed when the merge applied by to the tombstones
		// added to the batch in runPreApplyTriggersAfterStagingWriteBatch.

garbled sentence?

The TODO discussed a merge followed by a split and alluded to an
optimization that could be used to avoid a snapshot. This doesn't make
sense to me. I think that's because the comment was written back in a
time where a range merged into and then back out of its left hand
neighbor would be the same range with the same ID. This is not the case
anymore, so I don't see how we could ever (reasonably) avoid a snapshot
in such a case.
This PR cleans up some handling around replica destruction that scared me
when working on cockroachdb#46329 and cockroachdb#55293. Specifically, there were cases during
merges where the destroy status on a replica would not be set until after
that replicas data was destroyed. This was true of merges applied through
entry application, although in those cases, no user data was removed so it
seems mostly harmless. More concerning is that is was true of merges applied
through snapshot application. This was extra concerning because snapshot
application can delete user data if a subsumed range is only partially
covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically
risking user-visible correctness problems, especially now that we allow
follower reads through on subsumed ranges as of cockroachdb#51594.

This PR patches up these issues and then adds a few extra assertions that
enforce stricter preconditions for the functions at play during replica
destruction. Specifically, we now assert that the destroy status is set to
`destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also
assert that if `RemoveOptions.DestroyData` is false when passed to
`removeInitializedReplicaRaftMuLocked`, then the destroy status is also set.

This unification allows us to remove the `ignoreDestroyStatus` flag on
`RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`.
In hindsight, pushing on why this extra flag was needed and what new
information it was conveying to the function could have potentially caught
these issues a little earlier.

I think we'll want to backport this to v20.2, though probably in the first
patch release because there is some risk here (especially without sufficient
time to bake on master) and we aren't aware of seeing any correctness issues
from the combination of the bug fixed here and cockroachdb#51594.

Release note (bug fix): A hypothesized bug that could allow a follower
read to miss data on a range in the middle of being merged away into its
left-hand neighbor was fixed. The bug seems exceedingly unlikely to have
materialized in practice.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/mergeDestroyStatus branch from 62e1eb6 to 8142440 Compare October 14, 2020 23:29
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, were we trying to solve the following hazard with this PR (comment snippet lifted from clearSubsumedReplicaDiskData):

Yes, that's exactly the hazard we are trying to solve.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @ajwerner)


pkg/kv/kvserver/store_merge.go, line 55 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

garbled sentence?

Done.

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for clarifying.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15 and @ajwerner)

@nvanbenschoten
Copy link
Member Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 15, 2020

Build succeeded:

@craig craig bot merged commit 80e7127 into cockroachdb:master Oct 15, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/mergeDestroyStatus branch October 15, 2020 18:34
@rafiss rafiss added this to the 20.2 milestone Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants