-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: campaign eagerly upon removing raft leader #41122
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice just tracking down these stalls. The fix here seems reasonable, although I think it probably belongs in handleChangeReplicasResult
to mirror the handling of removing our own replica.
I'm curious about the attempt at avoiding having all replicas campaign by only having the first replica in the descriptor campaign. Did you see thrashing in practice? I thought the Raft leader election protocol converges quickly even when multiple peers campaign at the same time. If you didn't see any thrashing then I'd lean towards having anyone who sees the leader get removed campaign for simplicity.
What's your plan here? Do you need someone to carry this over the finish line?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your plan here? Do you need someone to carry this over the finish line?
Would happily hand this off, I won't be able to do anything else here probably until 19.2 is out.
Did you see thrashing in practice?
Yes, two replicas can both make it through pre-vote and then tie their election. The result is then, you guess it, another (3s?) stall as they both back off. This should repro readily for whoever picks up this PR by just having them all campaign and running the --count=100
invocation above. Definitely should give this another look - with just three replicas this stalemate shouldn't be able to happen between just two voters (you'd need three, but I remember seeing two). The membership changes in the test are nontrivial though, so it probably wasn't in a three member config. Possibly two members? Anyway, all things working I'd also prefer symmetry here though it looked like that wasn't working well.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
ed95726
to
dacb8d2
Compare
I finally got around to dusting this off, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)
pkg/storage/replica_raft.go, line 769 at r1 (raw file):
const expl = "during advance" r.mu.Lock()
Why did you make this locking change? It doesn't seem necessary.
pkg/storage/replica_raft.go, line 774 at r1 (raw file):
if stats.numConfChangeEntries > 0 { // If a config change was carried out, it's possible that the Raft
nit: Is there a method this could live in to avoid bloating handleRaftReady further?
pkg/storage/replica_raft.go, line 783 at r1 (raw file):
// changes. // We can't (or rather shouldn't) campaign on all remaining voters // because that can lead to a stalemate.
If you do end up pulling this out such that you have more room for a comment, consider elaborating on this stalemate as you did in the PR discussion.
pkg/storage/replica_raft.go, line 786 at r1 (raw file):
desc := r.descRLocked() st := raftGroup.BasicStatus() if st.Lead == 0 || !desc.IsInitialized() {
st.Lead
has a comment that says "must use atomic operations to access; keep 64-bit aligned." We'll need to do that here and below.
Also, consider splitting these cases up and giving each a comment. It's better to be extra verbose with all of this stuff while we have it paged in.
pkg/storage/replica_raft.go, line 787 at r1 (raw file):
st := raftGroup.BasicStatus() if st.Lead == 0 || !desc.IsInitialized() { return true, nil
Do you want to return here? Aren't we missing the enqueueRaftUpdateCheck
check. A separate function would make this easier to avoid.
pkg/storage/replica_raft.go, line 791 at r1 (raw file):
_, leaderStillThere := desc.GetReplicaDescriptorByID(roachpb.ReplicaID(st.Lead)) if !leaderStillThere && sm.r.store.StoreID() == desc.Replicas().Voters()[0].StoreID { _ = raftGroup.Campaign()
Is it worth logging like we do in maybeCampaignOnWakeLocked
?
Prior to this change, ``` make test PKG=./pkg/storage/ TESTS='TestAdminRelocateRange$$' \ TESTFLAGS='-v -count 100 --vmodule=raft=1 -failfast' ``` would reliably fail with the following patch ```diff diff --git a/pkg/storage/client_relocate_range_test.go b/pkg/storage/client_relocate_range_test.go index db7ffa2c0d..65cdf12405 100644 --- a/pkg/storage/client_relocate_range_test.go +++ b/pkg/storage/client_relocate_range_test.go @@ -140,6 +142,8 @@ func TestAdminRelocateRange(t *testing.T) { tc := testcluster.StartTestCluster(t, 6, args) defer tc.Stopper().Stop(ctx) + tBegin := timeutil.Now() + // s1 (LH) ---> s2 (LH) s1 s3 // Pure upreplication. k := keys.MustAddr(tc.ScratchRange(t)) @@ -198,4 +202,8 @@ func TestAdminRelocateRange(t *testing.T) { return relocateAndCheck(t, tc, k, tc.Targets(2, 4)) }) } + + if dur := timeutil.Since(tBegin); dur > 4*time.Second { + t.Errorf("test took %.2fs", dur.Seconds()) + } } ``` because whenever the raft leader would end up getting removed, the range would be unavailable for a few seconds (until the raft election timeout fired and a new leader was elected). This patch detects removal of the leader in the raft handling loop, and lets the first remaining voter campaign. Release note (bug fix): fixed a short (3s) period of unavailability that could occur on a range after removing a replica.
dacb8d2
to
324c3d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay on this, PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/storage/replica_raft.go, line 769 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why did you make this locking change? It doesn't seem necessary.
Just for clarity - if you run r.withRaftGroup
and in the closure passed to it call r.Desc()
, you deadlock. It's more obvious this way that r.mu
is locked.
pkg/storage/replica_raft.go, line 786 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
st.Lead
has a comment that says "must use atomic operations to access; keep 64-bit aligned." We'll need to do that here and below.Also, consider splitting these cases up and giving each a comment. It's better to be extra verbose with all of this stuff while we have it paged in.
That comment on st.Lead
is nonsense, and this particular st.Lead
is a copy of the internal one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r2.
Reviewable status: complete! 2 of 0 LGTMs obtained
bors r=ajwerner,nvanbenschoten |
Build failed (retrying...) |
Build failed |
missing release justification bors r=ajwerner,nvanbenschoten |
bors r=ajwerner,nvanbenschoten |
Build succeeded |
Prior to this change,
would reliably fail with the following patch
because whenever the raft leader would end up getting removed, the range
would be unavailable for a few seconds (until the raft election timeout
fired and a new leader was elected).
This patch detects removal of the leader in the raft handling loop, and
lets the first remaining voter campaign.
Release justification: fixes a bug.
Release note (bug fix): fixed a short (3s) period of unavailability that
could occur on a range after removing a replica.