Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: check whether the acquiring replica can receive the lease #42068

Conversation

ajwerner
Copy link
Contributor

Before this PR, when evaluating lease transfers or requests we'd check whether
the evaluating replica could receive the lease rather than the acquiring
replica. This patch fixes that bug by checking the state of the correct
replica.

Fixes #38720.

Release note (bug fix): The system no longer erroneously transfers leases
to replicas which are in the process of being removed which can lead to
ranges being effectively unavailable due to an invalid lease.

@ajwerner
Copy link
Contributor Author

This needs a test.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andy-kimball andy-kimball mentioned this pull request Oct 31, 2019
53 tasks
@ajwerner ajwerner force-pushed the ajwerner/prevent-voter-outgoing-from-getting-the-lease branch from 9632f2c to 1ef1954 Compare October 31, 2019 18:12
@ajwerner ajwerner requested a review from danhhz October 31, 2019 18:13
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/storage/client_lease_test.go, line 352 at r1 (raw file):

	defer tc.Stopper().Stop(ctx)

	scratchStartKey := tc.ScratchRange(t)

I'll add some more commentary here about what's going on.

@ajwerner ajwerner force-pushed the ajwerner/prevent-voter-outgoing-from-getting-the-lease branch 3 times, most recently from 339cc06 to c26aafe Compare October 31, 2019 18:24
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks for getting this out so fast.

Do you mind writing up an issue about the other issue we hypothesized with RequestLease?

Reviewed 3 of 4 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)


pkg/storage/client_lease_test.go, line 365 at r3 (raw file):

that at least

"that the test at least"?


pkg/storage/client_lease_test.go, line 395 at r3 (raw file):

					// The error generated prior to evaluation which we try to avoid
					// by sleeping a bit and running the go scheduler but still can
					// happen.

Do you mind explaining why we can get this in a little more detail? Is the problem that we occasionally haven't begun acquiring latches by the time we close(ch) and let the ChangeReplicas through?


pkg/storage/batcheval/cmd_lease.go, line 46 at r3 (raw file):

func checkCanReceiveLease(newLease *roachpb.Lease, rec EvalContext) error {
	repDesc, ok := rec.Desc().GetReplicaDescriptorByID(newLease.Replica.ReplicaID)

Is there a reason to switch from GetReplicaDescriptor to GetReplicaDescriptorByID?

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind writing up an issue about the other issue we hypothesized with RequestLease?

Yep, doing that right now.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)

@ajwerner ajwerner force-pushed the ajwerner/prevent-voter-outgoing-from-getting-the-lease branch from c26aafe to 4e00918 Compare October 31, 2019 18:47
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @nvanbenschoten)


pkg/storage/client_lease_test.go, line 395 at r3 (raw file):

Is the problem that we occasionally haven't begun acquiring latches by the time we close(ch) and let the ChangeReplicas through?

Yes, done.


pkg/storage/batcheval/cmd_lease.go, line 46 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is there a reason to switch from GetReplicaDescriptor to GetReplicaDescriptorByID?

Using the replica ID seems better. Imagine a case where n1/1 and now is n1/4. Imagine an attempt to transfer the lease to n1/1 The code here as well as evalNewLease only uses the store ID and I believe would permit that transfer which would be generally non-sensical.

It's not clear how such a request could come to exist but I don't see a reason to let a lease request for a replica which is not in the range but a store which is to succeed.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)

@ajwerner ajwerner force-pushed the ajwerner/prevent-voter-outgoing-from-getting-the-lease branch from 4e00918 to 12e6018 Compare October 31, 2019 19:14
Before this PR, when evaluating lease transfers or requests we'd check whether
the evaluating replica could receive the lease rather than the acquiring
replica. This patch fixes that bug by checking the state of the correct
replica.

Release note (bug fix): The system no longer erroneously transfers leases
to replicas which are in the process of being removed which can lead to
ranges being effectively unavailable due to an invalid lease.
@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 31, 2019

Build failed

@ajwerner
Copy link
Contributor Author

TeamCity builder ran out of space.

bors r+

craig bot pushed a commit that referenced this pull request Oct 31, 2019
42068: storage: check whether the acquiring replica can receive the lease r=ajwerner a=ajwerner

Before this PR, when evaluating lease transfers or requests we'd check whether
the evaluating replica could receive the lease rather than the acquiring
replica. This patch fixes that bug by checking the state of the correct
replica.

Fixes #38720.

Release note (bug fix): The system no longer erroneously transfers leases
to replicas which are in the process of being removed which can lead to
ranges being effectively unavailable due to an invalid lease.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Oct 31, 2019

Build succeeded

@craig craig bot merged commit 23cbf6c into cockroachdb:master Oct 31, 2019
@solongordon solongordon mentioned this pull request Nov 1, 2019
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: clearrange/checks=true failed
3 participants