-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: check whether the acquiring replica can receive the lease #42068
storage: check whether the acquiring replica can receive the lease #42068
Conversation
This needs a test. |
9632f2c
to
1ef1954
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)
pkg/storage/client_lease_test.go, line 352 at r1 (raw file):
defer tc.Stopper().Stop(ctx) scratchStartKey := tc.ScratchRange(t)
I'll add some more commentary here about what's going on.
339cc06
to
c26aafe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for getting this out so fast.
Do you mind writing up an issue about the other issue we hypothesized with RequestLease
?
Reviewed 3 of 4 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)
pkg/storage/client_lease_test.go, line 365 at r3 (raw file):
that at least
"that the test at least"?
pkg/storage/client_lease_test.go, line 395 at r3 (raw file):
// The error generated prior to evaluation which we try to avoid // by sleeping a bit and running the go scheduler but still can // happen.
Do you mind explaining why we can get this in a little more detail? Is the problem that we occasionally haven't begun acquiring latches by the time we close(ch)
and let the ChangeReplicas through?
pkg/storage/batcheval/cmd_lease.go, line 46 at r3 (raw file):
func checkCanReceiveLease(newLease *roachpb.Lease, rec EvalContext) error { repDesc, ok := rec.Desc().GetReplicaDescriptorByID(newLease.Replica.ReplicaID)
Is there a reason to switch from GetReplicaDescriptor
to GetReplicaDescriptorByID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind writing up an issue about the other issue we hypothesized with RequestLease?
Yep, doing that right now.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)
c26aafe
to
4e00918
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @nvanbenschoten)
pkg/storage/client_lease_test.go, line 395 at r3 (raw file):
Is the problem that we occasionally haven't begun acquiring latches by the time we close(ch) and let the ChangeReplicas through?
Yes, done.
pkg/storage/batcheval/cmd_lease.go, line 46 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is there a reason to switch from
GetReplicaDescriptor
toGetReplicaDescriptorByID
?
Using the replica ID seems better. Imagine a case where n1/1
and now is n1/4
. Imagine an attempt to transfer the lease to n1/1
The code here as well as evalNewLease
only uses the store ID and I believe would permit that transfer which would be generally non-sensical.
It's not clear how such a request could come to exist but I don't see a reason to let a lease request for a replica which is not in the range but a store which is to succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)
4e00918
to
12e6018
Compare
Before this PR, when evaluating lease transfers or requests we'd check whether the evaluating replica could receive the lease rather than the acquiring replica. This patch fixes that bug by checking the state of the correct replica. Release note (bug fix): The system no longer erroneously transfers leases to replicas which are in the process of being removed which can lead to ranges being effectively unavailable due to an invalid lease.
12e6018
to
23cbf6c
Compare
bors r+ |
Build failed |
TeamCity builder ran out of space. bors r+ |
42068: storage: check whether the acquiring replica can receive the lease r=ajwerner a=ajwerner Before this PR, when evaluating lease transfers or requests we'd check whether the evaluating replica could receive the lease rather than the acquiring replica. This patch fixes that bug by checking the state of the correct replica. Fixes #38720. Release note (bug fix): The system no longer erroneously transfers leases to replicas which are in the process of being removed which can lead to ranges being effectively unavailable due to an invalid lease. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Build succeeded |
Before this PR, when evaluating lease transfers or requests we'd check whether
the evaluating replica could receive the lease rather than the acquiring
replica. This patch fixes that bug by checking the state of the correct
replica.
Fixes #38720.
Release note (bug fix): The system no longer erroneously transfers leases
to replicas which are in the process of being removed which can lead to
ranges being effectively unavailable due to an invalid lease.