Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: campaign eagerly upon removing raft leader #41122

Merged
merged 1 commit into from
Apr 13, 2020

Conversation

tbg
Copy link
Member

@tbg tbg commented Sep 26, 2019

Prior to this change,

make test PKG=./pkg/storage/ TESTS='TestAdminRelocateRange$$' \
    TESTFLAGS='-v -count 100 --vmodule=raft=1 -failfast'

would reliably fail with the following patch

diff --git a/pkg/storage/client_relocate_range_test.go b/pkg/storage/client_relocate_range_test.go
index db7ffa2c0d..65cdf12405 100644
--- a/pkg/storage/client_relocate_range_test.go
+++ b/pkg/storage/client_relocate_range_test.go
@@ -140,6 +142,8 @@ func TestAdminRelocateRange(t *testing.T) {
 	tc := testcluster.StartTestCluster(t, 6, args)
 	defer tc.Stopper().Stop(ctx)

+	tBegin := timeutil.Now()
+
 	// s1 (LH) ---> s2 (LH) s1 s3
 	// Pure upreplication.
 	k := keys.MustAddr(tc.ScratchRange(t))
@@ -198,4 +202,8 @@ func TestAdminRelocateRange(t *testing.T) {
 			return relocateAndCheck(t, tc, k, tc.Targets(2, 4))
 		})
 	}
+
+	if dur := timeutil.Since(tBegin); dur > 4*time.Second {
+		t.Errorf("test took %.2fs", dur.Seconds())
+	}
 }

because whenever the raft leader would end up getting removed, the range
would be unavailable for a few seconds (until the raft election timeout
fired and a new leader was elected).

This patch detects removal of the leader in the raft handling loop, and
lets the first remaining voter campaign.

Release justification: fixes a bug.
Release note (bug fix): fixed a short (3s) period of unavailability that
could occur on a range after removing a replica.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice just tracking down these stalls. The fix here seems reasonable, although I think it probably belongs in handleChangeReplicasResult to mirror the handling of removing our own replica.

I'm curious about the attempt at avoiding having all replicas campaign by only having the first replica in the descriptor campaign. Did you see thrashing in practice? I thought the Raft leader election protocol converges quickly even when multiple peers campaign at the same time. If you didn't see any thrashing then I'd lean towards having anyone who sees the leader get removed campaign for simplicity.

What's your plan here? Do you need someone to carry this over the finish line?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your plan here? Do you need someone to carry this over the finish line?

Would happily hand this off, I won't be able to do anything else here probably until 19.2 is out.

Did you see thrashing in practice?

Yes, two replicas can both make it through pre-vote and then tie their election. The result is then, you guess it, another (3s?) stall as they both back off. This should repro readily for whoever picks up this PR by just having them all campaign and running the --count=100 invocation above. Definitely should give this another look - with just three replicas this stalemate shouldn't be able to happen between just two voters (you'd need three, but I remember seeing two). The membership changes in the test are nontrivial though, so it probably wasn't in a three member config. Possibly two members? Anyway, all things working I'd also prefer symmetry here though it looked like that wasn't working well.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

@tbg tbg changed the title [wip] storage: campaign eagerly upon removing raft leader storage: campaign eagerly upon removing raft leader Feb 25, 2020
@tbg
Copy link
Member Author

tbg commented Feb 27, 2020

I finally got around to dusting this off, PTAL.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)


pkg/storage/replica_raft.go, line 769 at r1 (raw file):

	const expl = "during advance"

	r.mu.Lock()

Why did you make this locking change? It doesn't seem necessary.


pkg/storage/replica_raft.go, line 774 at r1 (raw file):

		if stats.numConfChangeEntries > 0 {
			// If a config change was carried out, it's possible that the Raft

nit: Is there a method this could live in to avoid bloating handleRaftReady further?


pkg/storage/replica_raft.go, line 783 at r1 (raw file):

			// changes.
			// We can't (or rather shouldn't) campaign on all remaining voters
			// because that can lead to a stalemate.

If you do end up pulling this out such that you have more room for a comment, consider elaborating on this stalemate as you did in the PR discussion.


pkg/storage/replica_raft.go, line 786 at r1 (raw file):

			desc := r.descRLocked()
			st := raftGroup.BasicStatus()
			if st.Lead == 0 || !desc.IsInitialized() {

st.Lead has a comment that says "must use atomic operations to access; keep 64-bit aligned." We'll need to do that here and below.

Also, consider splitting these cases up and giving each a comment. It's better to be extra verbose with all of this stuff while we have it paged in.


pkg/storage/replica_raft.go, line 787 at r1 (raw file):

			st := raftGroup.BasicStatus()
			if st.Lead == 0 || !desc.IsInitialized() {
				return true, nil

Do you want to return here? Aren't we missing the enqueueRaftUpdateCheck check. A separate function would make this easier to avoid.


pkg/storage/replica_raft.go, line 791 at r1 (raw file):

			_, leaderStillThere := desc.GetReplicaDescriptorByID(roachpb.ReplicaID(st.Lead))
			if !leaderStillThere && sm.r.store.StoreID() == desc.Replicas().Voters()[0].StoreID {
				_ = raftGroup.Campaign()

Is it worth logging like we do in maybeCampaignOnWakeLocked?

Prior to this change,

```
make test PKG=./pkg/storage/ TESTS='TestAdminRelocateRange$$' \
    TESTFLAGS='-v -count 100 --vmodule=raft=1 -failfast'
```

would reliably fail with the following patch

```diff
diff --git a/pkg/storage/client_relocate_range_test.go b/pkg/storage/client_relocate_range_test.go
index db7ffa2c0d..65cdf12405 100644
--- a/pkg/storage/client_relocate_range_test.go
+++ b/pkg/storage/client_relocate_range_test.go
@@ -140,6 +142,8 @@ func TestAdminRelocateRange(t *testing.T) {
 	tc := testcluster.StartTestCluster(t, 6, args)
 	defer tc.Stopper().Stop(ctx)

+	tBegin := timeutil.Now()
+
 	// s1 (LH) ---> s2 (LH) s1 s3
 	// Pure upreplication.
 	k := keys.MustAddr(tc.ScratchRange(t))
@@ -198,4 +202,8 @@ func TestAdminRelocateRange(t *testing.T) {
 			return relocateAndCheck(t, tc, k, tc.Targets(2, 4))
 		})
 	}
+
+	if dur := timeutil.Since(tBegin); dur > 4*time.Second {
+		t.Errorf("test took %.2fs", dur.Seconds())
+	}
 }
```

because whenever the raft leader would end up getting removed, the range
would be unavailable for a few seconds (until the raft election timeout
fired and a new leader was elected).

This patch detects removal of the leader in the raft handling loop, and
lets the first remaining voter campaign.

Release note (bug fix): fixed a short (3s) period of unavailability that
could occur on a range after removing a replica.
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay on this, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/storage/replica_raft.go, line 769 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why did you make this locking change? It doesn't seem necessary.

Just for clarity - if you run r.withRaftGroup and in the closure passed to it call r.Desc(), you deadlock. It's more obvious this way that r.mu is locked.


pkg/storage/replica_raft.go, line 786 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

st.Lead has a comment that says "must use atomic operations to access; keep 64-bit aligned." We'll need to do that here and below.

Also, consider splitting these cases up and giving each a comment. It's better to be extra verbose with all of this stuff while we have it paged in.

That comment on st.Lead is nonsense, and this particular st.Lead is a copy of the internal one.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 4 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r2.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained

@tbg
Copy link
Member Author

tbg commented Mar 16, 2020

bors r=ajwerner,nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Mar 16, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Mar 16, 2020

Build failed

@tbg
Copy link
Member Author

tbg commented Mar 16, 2020

missing release justification

bors r=ajwerner,nvanbenschoten

@tbg
Copy link
Member Author

tbg commented Apr 13, 2020

bors r=ajwerner,nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Apr 13, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants