Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] storage: allow atomic removal/addition of multiple LEARNERs #40268

Closed
wants to merge 1 commit into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Aug 27, 2019

@nvanbenschoten this doesn't even compile because it needs to sit on top of
#40234. The meat is in the commentary anyway, I don't think it's clear enough
yet, lmk how I can improve it.


When we add/remove more than one voter "atomically", we end up with a
range descriptor that has the affected voters as
VOTER_{INCOMING,OUTGOING}.

Adding/removing multiple learners does not need joint consensus, but
etcd/raft does require it on a technicality (not being able to prove
that the learner changes don't correspond to voter demotions, basically).

We don't have LEARNER_{INCOMING,OUTGOING} nor do we want to clutter the
range descriptor further. As a result, we're unable to accurately
reflect a joint configuration in which only learners changed via the
range descriptor, and thus aren't able to mutate only learners in an
atomic replication change.

We could argue that this isn't needed, but it's actually a very awkward
papercut because for removing nodes, we don't distinguish between
learners and voters and so anyone running atomic replication changes
needs to make sure they're not accidentally trying to remove multiple
learners at once.

To lift this restriction, just work around the raft restriction. We make
the convention that a transition into a joint raft config that takes
place under a descriptor that doesn't reflect a joint config is left
right when it is entered (i.e. in the same raft command). This
essentially amounts to letting etcd/raft trust us that we are not
running demotions, which we are not.

Release note: None

When we add/remove more than one voter "atomically", we end up with a
range descriptor that has the affected voters as
VOTER_{INCOMING,OUTGOING}.

Adding/removing multiple learners does not need joint consensus, but
etcd/raft does require it on a technicality (not being able to prove
that the learner changes don't correspond to voter demotions, basically).

We don't have LEARNER_{INCOMING,OUTGOING} nor do we want to clutter the
range descriptor further. As a result, we're unable to accurately
reflect a joint configuration in which only learners changed via the
range descriptor, and thus aren't able to mutate only learners in an
atomic replication change.

We could argue that this isn't needed, but it's actually a very awkward
papercut because for removing nodes, we don't distinguish between
learners and voters and so anyone running atomic replication changes
needs to make sure they're not accidentally trying to remove multiple
learners at once.

To lift this restriction, just work around the raft restriction. We make
the convention that a transition into a joint raft config that takes
place under a descriptor that doesn't reflect a joint config is left
right when it is entered (i.e. in the same raft command).  This
essentially amounts to letting etcd/raft trust us that we are not
running demotions, which we are not.

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/roachpb/data.go, line 1456 at r1 (raw file):

	// LEARNER_{INCOMING,OUTGOING} replica types).
	//
	// Unfortunately, etcd/raft *always requires* a joint consensus transition

Did you consider fixing this in etcd/raft instead of here? We've done a good job thus far pushing complexity into the library itself.

I've seen every addition of a learner could be a demotion of a voter a few times, but I don't think I understand it.


pkg/roachpb/data.go, line 1494 at r1 (raw file):

	// A trigger that neither adds nor removes anything is how we encode a request
	// to leave a joint state. Note that this is only applicable in case 1 above,
	// i.e. we're chaning something about the voters.

changing


pkg/storage/client_atomic_membership_change_test.go, line 136 at r1 (raw file):

}

// TODO(tbg): finish this test, add comments.

Reminder to address this TODO.


pkg/storage/replica_application_state_machine.go, line 1008 at r1 (raw file):

		}
		return sm.r.withRaftGroup(true, func(raftGroup *raft.RawNode) (bool, error) {
			cc := cmd.confChange.ConfChangeI.AsV2()

This is all below Raft, so do we need a new cluster version for this?


pkg/storage/replica_application_state_machine.go, line 1012 at r1 (raw file):

len(newCfg.Voters[1]) > 0

This really makes me wish we made this a method on tracker.Config.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/roachpb/data.go, line 1456 at r1 (raw file):

Did you consider fixing this in etcd/raft instead of here? We've done a good job thus far pushing complexity into the library itself.

I had and thought it was a bad idea, but now I'm coming back around to fixing it there.

but I don't think I understand it.

When we have a desc with two learners and want to remove both, we know we're removing learners. In etcd/raft, if the ConfChange to remove two nodes is received by a follower, it can't decide what the active config is so it has to assume that we're removing two voters (but they're really learners). The leader can perform the proper check when putting it in the log, but at the time I thought that it wasn't a good idea to let the app tell raft "run this as a simple change, trust me" because there's a good chance this would be used when it's not actually safe. But now I feel that changing multiple learners at once without going through a joint case is actually useful and any reasonable app interested in joint consensus will probably hit this problem.

Will ping you when this is RFAL.

tbg added a commit to tbg/cockroach that referenced this pull request Aug 30, 2019
We don't support removing multiple learners atomically just yet, though
\cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207
because we'll need that first to be able to bump raft again (assuming we
don't want to fork). Instead of dealing with all that upfront, let's
just not remove multiple learners at once right now so that we can flip
the default for atomic replication changes to on.

If anyone is still trying to remove only learners atomically, they will
fail. However the replicate queues place a high priority on removing
stray learners whenever it finds them, so this wouldn't be a permanent
problem. Besides, we never add multiple learners at once so it's
difficult to get into that state in the first place.

Without this commit, TestLearnerAdminRelocateRange fails once atomic
replication changes are enabled.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Sep 3, 2019
We don't support removing multiple learners atomically just yet, though
\cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207
because we'll need that first to be able to bump raft again (assuming we
don't want to fork). Instead of dealing with all that upfront, let's
just not remove multiple learners at once right now so that we can flip
the default for atomic replication changes to on.

If anyone is still trying to remove only learners atomically, they will
fail. However the replicate queues place a high priority on removing
stray learners whenever it finds them, so this wouldn't be a permanent
problem. Besides, we never add multiple learners at once so it's
difficult to get into that state in the first place.

Without this commit, TestLearnerAdminRelocateRange fails once atomic
replication changes are enabled.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Sep 4, 2019
We don't support removing multiple learners atomically just yet, though
\cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207
because we'll need that first to be able to bump raft again (assuming we
don't want to fork). Instead of dealing with all that upfront, let's
just not remove multiple learners at once right now so that we can flip
the default for atomic replication changes to on.

If anyone is still trying to remove only learners atomically, they will
fail. However the replicate queues place a high priority on removing
stray learners whenever it finds them, so this wouldn't be a permanent
problem. Besides, we never add multiple learners at once so it's
difficult to get into that state in the first place.

Without this commit, TestLearnerAdminRelocateRange fails once atomic
replication changes are enabled.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 4, 2019
40370: storage: prepare for kv.atomic_replication_changes=true r=nvanbenschoten a=tbg

First three commits are #40363.

----

This PR enables atomic replication changes by default. But most of it is
just dealing with the fallout of doing so:

1. we don't handle removal of multiple learners well at the moment. This will
   be fixed more holistically in #40268, but it's not worth waiting for that
   because it's easy for us to just avoid the problem.
2. tests that carry out splits become quite flaky because at the beginning of
   a split, we transition out of a joint config if we see one, and due to
   the initial upreplication we often do. If we lose the race against the
   replicate queue, the split catches an error for no good reason.
   I took this as an opportunity to refactor the descriptor comparisons
   and to make this specific case a noop, but making it easier to avoid
   this general class of conflict where it's avoidable in the future.

There are probably some more problems that will only become apparent over time,
but it's quite simple to turn the cluster setting off again and to patch things
up if we do.

Release note (general change): atomic replication changes are now enabled
by default.

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@tbg tbg closed this Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants