-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: use half-populated joint quorum #10779
Conversation
Can we try again removing v3 at the top of the line https://github.com/etcd-io/etcd/blob/master/go.mod? I forgot to delete that part. Sorry for delay. |
CI seems to fail for something unrelated (?): https://travis-ci.com/etcd-io/etcd/jobs/204967927 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI failure is a 20 minute timeout. A recent passing build took 19 minutes, so it looks like things have just gotten slow.
@@ -29,7 +29,7 @@ func TestMsgAppFlowControlFull(t *testing.T) { | |||
r.becomeCandidate() | |||
r.becomeLeader() | |||
|
|||
pr2 := r.prs.nodes[2] | |||
pr2 := r.prs.prs[2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repeated name is unfortunate (and I've never been a fan of the name prs
). r.tracker.progress[2]
looks better to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that will be the ultimate name. I have some work branched off that changes how the tracker consumes (multiple) conf changes, I'll save the rename for that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^- I opened a PR that pulls out the tracker.
raft/quorum/datadriven_test.go
Outdated
joint = true | ||
if arg.Vals[i] == "zero" { | ||
if len(arg.Vals) != 1 { | ||
t.Fatalf("cannot mix 'none' into configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/none/zero/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
raft/quorum/joint.go
Outdated
// - (pending, lost) | ||
// - (pending, won) | ||
// - (lost, pending) | ||
// - (lost, won) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It takes some thinking to get from this table of states to the logic below. An additional comment would be nice. Or I think we could simplify things:
if r1 == r2 {
// If they agree, return the agreed result.
return r1
} else if r1 == VoteLost || r2 == VoteLost {
// If either config has lost, loss is the only possible outcome
return VoteLost
} else {
// The configs differ without any losses; at least one must be pending.
return VotePending
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done.
raft/quorum/majority.go
Outdated
|
||
var votesCast int | ||
{ | ||
// Fill the slice from the right with the indexes observed. Any unused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why from the right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slice will be filled up with zeroes and then sorted, so I find it puts my mind on the right track to fill from the right. We could go the other way too. I added a comment
raft/quorum/quorum.go
Outdated
|
||
// IndexLookuper allows looking up a commit index for a given ID of a voter | ||
// from a corresponding MajorityConfig. | ||
type IndexLookuper interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshed: IndexAcks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the "doer" interface naming convention, so we'd have to go full-on IndexAckLookuper
. Let me know if you prefer that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about AckIndexer
? I also found that from far away IndexLookuper
didn't impart much intuition about the importance of the interface and why it was being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're following the doer
naming convention for single-method interfaces, the method should be IndexLookup
. But I'm not as committed to that pattern, and would rather avoid the "word" "lookuper" than blindly follow the convention.
None of the combinations are great, but I like "ack" or "vote" as noun here, and "index" or "query" as a verb.
raft/quorum/quorum.go
Outdated
// be committed as more voters report back. | ||
type CommitRange struct { | ||
Definitely uint64 | ||
Maybe uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is Maybe
interesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the commit index case we don't care because we always have all voters report in (and even if we didn't have that, we'd probably still not care). But in the voter case this matters because only when Definitely == Maybe
is the vote over. Added a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add one more sentence past "it's important when an election has failed"? Why is it important when an election has failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the above comment (which maybe wasn't visible to you when you wrote the review). I also added similar verbiage inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Maybe
is just for votes, it feels like we're really stretching the commit index abstraction. We map yes/no votes to numbers so we treat votes as commit indexes, but then we have to add extra complexity to the commit index handling. Maybe it's better to just implement voting and index committing separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Maybe is just for votes, it feels like we're really stretching the commit index abstraction.
Are we stretching the abstraction? It just so happens that the one caller that wants a committed index always passes acks (even if they might be zero) for all members (and so Definitely == Maybe
is guaranteed). The library is more "powerful" than it needs to be for the commit index case, but that's OK. Mapping the process of voting into a committed index computation is straightforward (at least as I perceive it).
Btw, to give a really clean example of where this matters, consider a config with two nodes in which one replica has voted "no" and the other one hasn't voted yet. The result is VoteLost
, but nobody has reached quorum. What matters is that "yes" cannot reach quorum any more. So we need some code that tells you that.
Btw, sorry for going back and forth on this and not just "doing it". These are small things but I've found in the past that there's something to learn from them. In that vein, if you're still uncomfortable with this, please give some more detail on why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just so happens that the one caller that wants a committed index always passes acks (even if they might be zero) for all members (and so Definitely == Maybe is guaranteed)
This is where the semantics get muddled. You're treating the passed-in ack as a certainty and getting Definitely == Maybe, while I would say that the passed-in acks are lower bounds (with uncertainty in one direction), so Maybe is always maxint (I didn't realize that you'd get Maybe == Definitely until this comment). Indexes can always increase; there's never a case in which an entry becomes impossible to commit (except when we lose our leadership and have to step down to follower).
Votes work differently: uncertainty is a third state and you can't transition from one certain state to another. I worry that even though there are some similarities between the append and vote cases, trying to handle them both in one interface invites confusion. I know there are tests and no one's going to change this code lightly, but the way it's organized it seems like it would be attractive to try and change the implementation in a way that only makes sense in one use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The confusion around CommittRange.Maybe
and what it takes into account is understandable. I don't share it but both you and Nathan have had it so it's definitely objectively confusing. Thanks for pushing me on this, I've updated the branch to add a simple implementation for voting and removed CommitRange and all of its vestiges. We lose a little bit of test coverage but it was abundant to begin with and still is, so I'm not worried about that.
Codecov Report
@@ Coverage Diff @@
## master #10779 +/- ##
==========================================
- Coverage 63% 63% -0.01%
==========================================
Files 391 395 +4
Lines 37327 37439 +112
==========================================
+ Hits 23518 23587 +69
- Misses 12216 12274 +58
+ Partials 1593 1578 -15
Continue to review full report at Codecov.
|
@xiang90 want to me to wait for you to review? I have follow-up work based on this so I have some interest in getting this in sooner rather than later. |
e558363
to
7b8e2c2
Compare
TestBalancerUnderBlackholeKeepAliveWatch looks pretty flaky. Does that ring a bell @gyuho? On master,
|
raft/quorum/datadriven_test.go
Outdated
// denotes an omission (i.e. no information for this voter); this is | ||
// different from 0. For example, | ||
// | ||
// cfg=(1,2) cfgj=(2,3,4) idxs=(_,5,_,7) initializes the idx for 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for member 2 to 5 and that for member 4 to 7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
raft/quorum/datadriven_test.go
Outdated
joint = true | ||
if arg.Vals[i] == "zero" { | ||
if len(arg.Vals) != 1 { | ||
t.Fatalf("cannot mix 'none' into configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/none/zero
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already done.
raft/quorum/joint.go
Outdated
} | ||
|
||
// CommittedIndex returns a CommitRange for the given joint quorum. An index is | ||
// jointly committed if it is committed on both constituent majorities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
committed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take your word for it.
raft/quorum/quorum.go
Outdated
|
||
// IndexLookuper allows looking up a commit index for a given ID of a voter | ||
// from a corresponding MajorityConfig. | ||
type IndexLookuper interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about AckIndexer
? I also found that from far away IndexLookuper
didn't impart much intuition about the importance of the interface and why it was being used.
raft/quorum/quorum.go
Outdated
// be committed as more voters report back. | ||
type CommitRange struct { | ||
Definitely uint64 | ||
Maybe uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add one more sentence past "it's important when an election has failed"? Why is it important when an election has failed?
@@ -0,0 +1,476 @@ | |||
## No difference between a simple majority quorum and a simple majority quorum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: double ##
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about AckIndexer? I also found that from far away IndexLookuper didn't impart much intuition about the importance of the interface and why it was being used.
Can't reply to the original comment for some reason but here we go:
Honestly I can't relate to AckIndexer much at all. What's the ack? Why is an ack being indexed? I'm not going to say that I love IndexLookuper
but at least it tells me that it "looks up an index" which hopefully suggests that you put $something in and get an "index" out.
Since I want to play with rafttoy before merging this, let's bikeshed some more... VoterToIndexMapper
? Doesn't quite roll of the tongue but hopefully actually explains it.
idx | ||
> 13 (id=1) | ||
x> 100 (id=2) | ||
13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this 13-100
? I think I'm still getting caught up on CommitRange.Maybe
. Is it not possible for a peer to vote on a higher index once it has already voted for a lower one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I added a comment and example on Maybe
.
0-∞ | ||
|
||
# 1 reports 100. | ||
committed cfg=(1,2) cfgj=(2) idx=(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider removing idx=(100)
as shorthand for idx=(100,_)
. This is all abstract enough that I fear people will need to re-read the parser each time they want to understand these tests, so it can't hurt to be as explicit and structured as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I wrote these first tests before introducing _
. Added code to the test harness to catch these mismatches and went to fix them all up.
@@ -0,0 +1,163 @@ | |||
# Empty joint config wins all votes. This isn't used in production. | |||
vote cfgj=zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a semantic difference in these tests between not including cfg=
for the first majority quorum and setting cfgj=zero
for the second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, zero is a marker value precisely to tell the test harness to use a joint quorum with an empty half. I added comments at the top of both tests to alert the reader to that fact.
raft/quorum/majority.go
Outdated
return CommitRange{Definitely: math.MaxUint64, Maybe: math.MaxUint64} | ||
} | ||
|
||
srt := slicePool.Get().([]uint64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty interested in whether any of these performance-related changes show up in rafttoy
. If so, I don't think it would be too bad to cache the slice in the MajorityConfig
struct and take it as a pointer receiver to this method. This would also allow us to avoid the reflection with sort.Slice
without then requiring an allocation to box the slice through an interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unable to find a performance regression in the numbers, but I did put in some extra work to actually remove the allocation here, PTAL at the updated first commit.
I wrote the code to stash a slice on MajorityConfig but it ended up sort of nasty in that it cluttered the way in which one interacts with a MajorityConfig. Instead, I went ahead and implemented sorting on the stack. This won't allocate for <= 7 voters (as demonstrated by the benchmark, I'm happy to lower that to 5 if we want to save some stack space).
f45b1b8
to
ead5bdb
Compare
@tbg It has been flaky. Please ignore while we investigate. |
I renamed the (now) |
RFAL @bdarnell. You only need to look at the additional fixup commits (which I'll patch into their respective base commits when you've looked at them). The first one is the salient one, the rest is just cleanup. |
The quorum package contains logic to reason about committed indexes as well as vote outcomes for both majority and joint quorums. The package is oblivious to the existence of learner replicas. The plan is to hook this up to etcd/raft in subsequent commits.
Instead of having disjoint mappings of ID to *Progress for voters and learners, use a map[id]struct{} for each and share a map of *Progress among them. This is easier to handle when joint quorums are introduced, at which point a node may be a voting member of two quorums.
To ease a future transition into joint quorums, this commit removes the previous "ad-hoc" majority-based quorum and vote computations with that introduced in the `raft/quorum` package. More specifically, the progressTracker now uses a quorum.JointConfig for which the "second" majority quorum is always empty; in this case the quorum behaves like the one quorum.MajorityConfig that is actually present. Or, more briefly, this change is a no-op, but it will take the busywork out of actually starting to make use of joint quorums in the future. On a side node, I suspect that this might've fixed a bug regarding the read index though I haven't been able to explicitly come up with a counter-example. The problem was that the acks collected for the read index weren't taking into account membership changes, so they'd run the danger of using acks from nodes since removed to claim that a quorum of acks had been received. There's a chance that there isn't a counter-example (the only guarantee extracted from the "quorum" is that there isn't another leader, but even if there's another leader all that matters is that that leader doesn't have a divergent history from the stale leader in the hypothetical counter-example), but either way there is morally a bug here that is now fixed because VoteCommitted doesn't care about votes from members that are not voters known to the currently active configuration.
TFTR! I force-pushed the squashed commit which I verified has an empty diff with the previous HEAD. |
This PR introduces and uses a library to support joint quorum computations. Note that this does not actually use joint quorums except the very special configuration in which the second majority is the empty set (and automatically agrees with everything), which then agrees with just a simple majority quorum. However, introducing the library now means that there will be less plumbing in follow-up PRs that actually introduce the use of joint quorums, and more scrutiny on the basics now.