Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: apply lease change side-effects on snapshot recipients #34548

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Feb 4, 2019

Fixes #34025.
Fixes #33624.
Fixes #33335.
Fixes #33151.
Fixes #33149.
Fixes #34159.
Fixes #34293.
Fixes #32813.
Fixes #30886.
Fixes #34228.
Fixes #34321.

It is rare but possible for a replica to become a leaseholder but not learn about this until it applies a snapshot. Immediately upon the snapshot application's ReplicaState update, the replica will begin operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger in-memory side-effects to be performed. This could result in a regression in the new leaseholder's timestamp cache compared to the previous leaseholder's cache, allowing write-skew like we saw in #34025. This could presumably result in other anomalies as well, because all of the steps in leasePostApply were skipped (as theorized by #34025 (comment)).

This PR fixes this bug by detecting lease updates when applying snapshots and making sure to react correctly to them. It also likely fixes the referenced issue. The new test demonstrates that without this fix, the serializable violation speculated about in the issue was possible.

@nvanbenschoten nvanbenschoten requested a review from a team February 4, 2019 23:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten changed the title [DNM] storage: apply lease change side-effects on snapshot recipients storage: apply lease change side-effects on snapshot recipients Feb 5, 2019
@nvanbenschoten
Copy link
Member Author

This has been updated with a test which reproduces the issue and which is fixed by the patch. It's ready for a look.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: thank you for getting that in place so quickly. The code changes LGTM, and the comments on the test are minor. (Please check that it's not flaky though if you haven't already).

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/client_replica_test.go, line 1855 at r1 (raw file):

		t.Fatal(pErr)
	}
	incArgs = incrementArgs(keyC, 2)

Mind renaming this incC so that the later use at the bottom is more obvious?


pkg/storage/client_replica_test.go, line 1876 at r1 (raw file):

that will


pkg/storage/client_replica_test.go, line 1908 at r1 (raw file):

	// ensures that that when node 2 comes back up it will require a snapshot
	// from Raft.
	mtc.transport.GetCircuitBreaker(mtc.idents[2].NodeID).Break()

Won't it possibly reset? I think the merge tests do something with the Raft transport to achieve the desired effect. Mind looking into whether that is more appropriate? There's a chance what you're doing is actually better. Not sure.


pkg/storage/replica_proposal.go, line 203 at r1 (raw file):
prevLease isn't necessarily the previous lease. I would like this to be obvious, maybe somePreviousLease with a comment

if permitJump is true, we only know that prevLease is some previous lease on the range, but it may not be the one in place before newLease became known to this replica. This typically happens when a snapshot is applied.

Open to better ideas.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Should we also optimistically mark this as fixing #34321?

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Fixes cockroachdb#34025.
Fixes cockroachdb#33624.
Fixes cockroachdb#33335.
Fixes cockroachdb#33151.
Fixes cockroachdb#33149.
Fixes cockroachdb#34159.
Fixes cockroachdb#34293.
Fixes cockroachdb#32813.
Fixes cockroachdb#30886.
Fixes cockroachdb#34228.
Fixes cockroachdb#34321.

It is rare but possible for a replica to become a leaseholder but not
learn about this until it applies a snapshot. Immediately upon the
snapshot application's `ReplicaState` update, the replica will begin
operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger
in-memory side-effects to be performed. This could result in a regression
in the new leaseholder's timestamp cache compared to the previous
leaseholder, allowing write-skew like we saw in cockroachdb#34025. This could
presumably result in other anomalies as well, because all of the
steps in `leasePostApply` were skipped.

This PR fixes this bug by detecting lease updates when applying
snapshots and making sure to react correctly to them. It also likely
fixes the referenced issue. The new test demonstrated that without
this fix, the serializable violation speculated about in the issue
was possible.

Release note (bug fix): Fix bug where lease transfers passed through
Snapshots could forget to update in-memory state on the new leaseholder,
allowing write-skew between read-modify-write operations.
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check that it's not flaky though if you haven't already

I stressed for about 10 minutes without a flake.

Should we also optimistically mark this as fixing #34321?

Done. And for a few more issues.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell and @tbg)


pkg/storage/client_replica_test.go, line 1855 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Mind renaming this incC so that the later use at the bottom is more obvious?

Done.


pkg/storage/client_replica_test.go, line 1876 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

that will

Done.


pkg/storage/client_replica_test.go, line 1908 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Won't it possibly reset? I think the merge tests do something with the Raft transport to achieve the desired effect. Mind looking into whether that is more appropriate? There's a chance what you're doing is actually better. Not sure.

I don't think it could reset because it's not just tripped, but the unreliableRaftHandler is much better! Done here and in another test that was using breakers.


pkg/storage/replica_proposal.go, line 203 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

prevLease isn't necessarily the previous lease. I would like this to be obvious, maybe somePreviousLease with a comment

if permitJump is true, we only know that prevLease is some previous lease on the range, but it may not be the one in place before newLease became known to this replica. This typically happens when a snapshot is applied.

Open to better ideas.

Done.

@nvanbenschoten
Copy link
Member Author

bors r+

@nvanbenschoten
Copy link
Member Author

bors crashed

bors r-

@nvanbenschoten
Copy link
Member Author

bors r+

craig bot pushed a commit that referenced this pull request Feb 5, 2019
34548: storage: apply lease change side-effects on snapshot recipients r=nvanbenschoten a=nvanbenschoten

Fixes #34025.
Fixes #33624.
Fixes #33335.
Fixes #33151.
Fixes #33149.
Fixes #34159.
Fixes #34293.
Fixes #32813.
Fixes #30886.
Fixes #34228.
Fixes #34321.

It is rare but possible for a replica to become a leaseholder but not learn about this until it applies a snapshot. Immediately upon the snapshot application's `ReplicaState` update, the replica will begin operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger in-memory side-effects to be performed. This could result in a regression in the new leaseholder's timestamp cache compared to the previous leaseholder's cache, allowing write-skew like we saw in #34025. This could presumably result in other anomalies as well, because all of the steps in `leasePostApply` were skipped (as theorized by #34025 (comment)).

This PR fixes this bug by detecting lease updates when applying snapshots and making sure to react correctly to them. It also likely fixes the referenced issue. The new test demonstrates that without this fix, the serializable violation speculated about in the issue was possible.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Contributor

craig bot commented Feb 5, 2019

Build succeeded

@craig craig bot merged commit 2bfea76 into cockroachdb:master Feb 5, 2019
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/fixLeaseSnap branch February 11, 2019 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment