Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: use half-populated joint quorum #10779

Merged
merged 4 commits into from
Jun 20, 2019
Merged

raft: use half-populated joint quorum #10779

merged 4 commits into from
Jun 20, 2019

Conversation

tbg
Copy link
Contributor

@tbg tbg commented Jun 3, 2019

This PR introduces and uses a library to support joint quorum computations. Note that this does not actually use joint quorums except the very special configuration in which the second majority is the empty set (and automatically agrees with everything), which then agrees with just a simple majority quorum. However, introducing the library now means that there will be less plumbing in follow-up PRs that actually introduce the use of joint quorums, and more scrutiny on the basics now.

@tbg tbg added the area/raft label Jun 3, 2019
@tbg tbg requested review from bdarnell, xiang90 and gyuho June 3, 2019 13:28
@gyuho
Copy link
Contributor

gyuho commented Jun 3, 2019

Can we try again removing v3 at the top of the line https://github.com/etcd-io/etcd/blob/master/go.mod? I forgot to delete that part. Sorry for delay.

@tbg
Copy link
Contributor Author

tbg commented Jun 3, 2019

CI seems to fail for something unrelated (?): https://travis-ci.com/etcd-io/etcd/jobs/204967927

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI failure is a 20 minute timeout. A recent passing build took 19 minutes, so it looks like things have just gotten slow.

raft/progress.go Outdated Show resolved Hide resolved
@@ -29,7 +29,7 @@ func TestMsgAppFlowControlFull(t *testing.T) {
r.becomeCandidate()
r.becomeLeader()

pr2 := r.prs.nodes[2]
pr2 := r.prs.prs[2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeated name is unfortunate (and I've never been a fan of the name prs). r.tracker.progress[2] looks better to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that will be the ultimate name. I have some work branched off that changes how the tracker consumes (multiple) conf changes, I'll save the rename for that PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^- I opened a PR that pulls out the tracker.

#10807

raft/quorum/datadriven_test.go Show resolved Hide resolved
joint = true
if arg.Vals[i] == "zero" {
if len(arg.Vals) != 1 {
t.Fatalf("cannot mix 'none' into configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/none/zero/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// - (pending, lost)
// - (pending, won)
// - (lost, pending)
// - (lost, won)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes some thinking to get from this table of states to the logic below. An additional comment would be nice. Or I think we could simplify things:

if r1 == r2 {
   // If they agree, return the agreed result.
   return r1
} else if r1 == VoteLost || r2 == VoteLost {
    // If either config has lost, loss is the only possible outcome
    return VoteLost
} else {
    // The configs differ without any losses; at least one must be pending.
    return VotePending
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.


var votesCast int
{
// Fill the slice from the right with the indexes observed. Any unused
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why from the right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slice will be filled up with zeroes and then sorted, so I find it puts my mind on the right track to fill from the right. We could go the other way too. I added a comment


// IndexLookuper allows looking up a commit index for a given ID of a voter
// from a corresponding MajorityConfig.
type IndexLookuper interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshed: IndexAcks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the "doer" interface naming convention, so we'd have to go full-on IndexAckLookuper. Let me know if you prefer that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about AckIndexer? I also found that from far away IndexLookuper didn't impart much intuition about the importance of the interface and why it was being used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're following the doer naming convention for single-method interfaces, the method should be IndexLookup. But I'm not as committed to that pattern, and would rather avoid the "word" "lookuper" than blindly follow the convention.

None of the combinations are great, but I like "ack" or "vote" as noun here, and "index" or "query" as a verb.

// be committed as more voters report back.
type CommitRange struct {
Definitely uint64
Maybe uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Maybe interesting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit index case we don't care because we always have all voters report in (and even if we didn't have that, we'd probably still not care). But in the voter case this matters because only when Definitely == Maybe is the vote over. Added a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add one more sentence past "it's important when an election has failed"? Why is it important when an election has failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the above comment (which maybe wasn't visible to you when you wrote the review). I also added similar verbiage inline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Maybe is just for votes, it feels like we're really stretching the commit index abstraction. We map yes/no votes to numbers so we treat votes as commit indexes, but then we have to add extra complexity to the commit index handling. Maybe it's better to just implement voting and index committing separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Maybe is just for votes, it feels like we're really stretching the commit index abstraction.

Are we stretching the abstraction? It just so happens that the one caller that wants a committed index always passes acks (even if they might be zero) for all members (and so Definitely == Maybe is guaranteed). The library is more "powerful" than it needs to be for the commit index case, but that's OK. Mapping the process of voting into a committed index computation is straightforward (at least as I perceive it).

Btw, to give a really clean example of where this matters, consider a config with two nodes in which one replica has voted "no" and the other one hasn't voted yet. The result is VoteLost, but nobody has reached quorum. What matters is that "yes" cannot reach quorum any more. So we need some code that tells you that.

Btw, sorry for going back and forth on this and not just "doing it". These are small things but I've found in the past that there's something to learn from them. In that vein, if you're still uncomfortable with this, please give some more detail on why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just so happens that the one caller that wants a committed index always passes acks (even if they might be zero) for all members (and so Definitely == Maybe is guaranteed)

This is where the semantics get muddled. You're treating the passed-in ack as a certainty and getting Definitely == Maybe, while I would say that the passed-in acks are lower bounds (with uncertainty in one direction), so Maybe is always maxint (I didn't realize that you'd get Maybe == Definitely until this comment). Indexes can always increase; there's never a case in which an entry becomes impossible to commit (except when we lose our leadership and have to step down to follower).

Votes work differently: uncertainty is a third state and you can't transition from one certain state to another. I worry that even though there are some similarities between the append and vote cases, trying to handle them both in one interface invites confusion. I know there are tests and no one's going to change this code lightly, but the way it's organized it seems like it would be attractive to try and change the implementation in a way that only makes sense in one use case.

Copy link
Contributor Author

@tbg tbg Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confusion around CommittRange.Maybe and what it takes into account is understandable. I don't share it but both you and Nathan have had it so it's definitely objectively confusing. Thanks for pushing me on this, I've updated the branch to add a simple implementation for voting and removed CommitRange and all of its vestiges. We lose a little bit of test coverage but it was abundant to begin with and still is, so I'm not worried about that.

@codecov-io
Copy link

codecov-io commented Jun 7, 2019

Codecov Report

Merging #10779 into master will decrease coverage by <.01%.
The diff coverage is 92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10779      +/-   ##
==========================================
- Coverage      63%      63%   -0.01%     
==========================================
  Files         391      395       +4     
  Lines       37327    37439     +112     
==========================================
+ Hits        23518    23587      +69     
- Misses      12216    12274      +58     
+ Partials     1593     1578      -15
Impacted Files Coverage Δ
raft/quorum/quorum.go 100% <100%> (ø)
raft/raft.go 91.29% <100%> (-0.7%) ⬇️
raft/quorum/joint.go 100% <100%> (ø)
raft/quorum/majority.go 100% <100%> (ø)
raft/read_only.go 88.09% <100%> (ø) ⬆️
raft/quorum/voteresult_string.go 30% <30%> (ø)
raft/progress.go 92.59% <85.41%> (-1.13%) ⬇️
auth/options.go 67.5% <0%> (-25%) ⬇️
clientv3/balancer/grpc1.7-health.go 15.98% <0%> (-17.16%) ⬇️
auth/range_perm_cache.go 45.56% <0%> (-15.19%) ⬇️
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ff7628...57def66. Read the comment docs.

@tbg
Copy link
Contributor Author

tbg commented Jun 8, 2019

@xiang90 want to me to wait for you to review? I have follow-up work based on this so I have some interest in getting this in sooner rather than later.

@tbg tbg force-pushed the jointq-pr branch 3 times, most recently from e558363 to 7b8e2c2 Compare June 8, 2019 20:23
@tbg
Copy link
Contributor Author

tbg commented Jun 9, 2019

TestBalancerUnderBlackholeKeepAliveWatch looks pretty flaky. Does that ring a bell @gyuho?

On master,

$ go test ./clientv3/integration/ -v -count 10 -failfast -run TestBalancerUnderBlackholeKeepAliveWatch
=== RUN   TestBalancerUnderBlackholeKeepAliveWatch
--- PASS: TestBalancerUnderBlackholeKeepAliveWatch (5.81s)
=== RUN   TestBalancerUnderBlackholeKeepAliveWatch
--- PASS: TestBalancerUnderBlackholeKeepAliveWatch (5.69s)
=== RUN   TestBalancerUnderBlackholeKeepAliveWatch
--- PASS: TestBalancerUnderBlackholeKeepAliveWatch (5.63s)
=== RUN   TestBalancerUnderBlackholeKeepAliveWatch
--- FAIL: TestBalancerUnderBlackholeKeepAliveWatch (6.17s)
    black_hole_test.go:83: took too long to receive watch events
FAIL
FAIL	go.etcd.io/etcd/clientv3/integration	23.319s

// denotes an omission (i.e. no information for this voter); this is
// different from 0. For example,
//
// cfg=(1,2) cfgj=(2,3,4) idxs=(_,5,_,7) initializes the idx for 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for member 2 to 5 and that for member 4 to 7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

joint = true
if arg.Vals[i] == "zero" {
if len(arg.Vals) != 1 {
t.Fatalf("cannot mix 'none' into configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/none/zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already done.

}

// CommittedIndex returns a CommitRange for the given joint quorum. An index is
// jointly committed if it is committed on both constituent majorities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committed in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take your word for it.


// IndexLookuper allows looking up a commit index for a given ID of a voter
// from a corresponding MajorityConfig.
type IndexLookuper interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about AckIndexer? I also found that from far away IndexLookuper didn't impart much intuition about the importance of the interface and why it was being used.

// be committed as more voters report back.
type CommitRange struct {
Definitely uint64
Maybe uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add one more sentence past "it's important when an election has failed"? Why is it important when an election has failed?

@@ -0,0 +1,476 @@
## No difference between a simple majority quorum and a simple majority quorum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double ##

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about AckIndexer? I also found that from far away IndexLookuper didn't impart much intuition about the importance of the interface and why it was being used.

Can't reply to the original comment for some reason but here we go:

Honestly I can't relate to AckIndexer much at all. What's the ack? Why is an ack being indexed? I'm not going to say that I love IndexLookuper but at least it tells me that it "looks up an index" which hopefully suggests that you put $something in and get an "index" out.

Since I want to play with rafttoy before merging this, let's bikeshed some more... VoterToIndexMapper? Doesn't quite roll of the tongue but hopefully actually explains it.

idx
> 13 (id=1)
x> 100 (id=2)
13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this 13-100? I think I'm still getting caught up on CommitRange.Maybe. Is it not possible for a peer to vote on a higher index once it has already voted for a lower one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I added a comment and example on Maybe.

0-∞

# 1 reports 100.
committed cfg=(1,2) cfgj=(2) idx=(100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider removing idx=(100) as shorthand for idx=(100,_). This is all abstract enough that I fear people will need to re-read the parser each time they want to understand these tests, so it can't hurt to be as explicit and structured as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I wrote these first tests before introducing _. Added code to the test harness to catch these mismatches and went to fix them all up.

@@ -0,0 +1,163 @@
# Empty joint config wins all votes. This isn't used in production.
vote cfgj=zero
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a semantic difference in these tests between not including cfg= for the first majority quorum and setting cfgj=zero for the second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, zero is a marker value precisely to tell the test harness to use a joint quorum with an empty half. I added comments at the top of both tests to alert the reader to that fact.

return CommitRange{Definitely: math.MaxUint64, Maybe: math.MaxUint64}
}

srt := slicePool.Get().([]uint64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty interested in whether any of these performance-related changes show up in rafttoy. If so, I don't think it would be too bad to cache the slice in the MajorityConfig struct and take it as a pointer receiver to this method. This would also allow us to avoid the reflection with sort.Slice without then requiring an allocation to box the slice through an interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to find a performance regression in the numbers, but I did put in some extra work to actually remove the allocation here, PTAL at the updated first commit.

I wrote the code to stash a slice on MajorityConfig but it ended up sort of nasty in that it cluttered the way in which one interacts with a MajorityConfig. Instead, I went ahead and implemented sorting on the stack. This won't allocate for <= 7 voters (as demonstrated by the benchmark, I'm happy to lower that to 5 if we want to save some stack space).

@tbg tbg force-pushed the jointq-pr branch 2 times, most recently from f45b1b8 to ead5bdb Compare June 10, 2019 04:57
@gyuho
Copy link
Contributor

gyuho commented Jun 11, 2019

TestBalancerUnderBlackholeKeepAliveWatch looks pretty flaky. Does that ring a bell @gyuho?

On master,

$ go test ./clientv3/integration/ -v -count 10 -failfast -run TestBalancerUnderBlackholeKeepAliveWatch
=== RUN   TestBalancerUnderBlackholeKeepAliveWatch
--- PASS: TestBalancerUnderBlackholeKeepAliveWatch (5.81s)
=== RUN   TestBalancerUnderBlackholeKeepAliveWatch
--- PASS: TestBalancerUnderBlackholeKeepAliveWatch (5.69s)
=== RUN   TestBalancerUnderBlackholeKeepAliveWatch
--- PASS: TestBalancerUnderBlackholeKeepAliveWatch (5.63s)
=== RUN   TestBalancerUnderBlackholeKeepAliveWatch
--- FAIL: TestBalancerUnderBlackholeKeepAliveWatch (6.17s)
    black_hole_test.go:83: took too long to receive watch events
FAIL
FAIL	go.etcd.io/etcd/clientv3/integration	23.319s

@tbg It has been flaky. Please ignore while we investigate.

@tbg
Copy link
Contributor Author

tbg commented Jun 14, 2019

I renamed the (now) AckedIndexer interface and added some more commentary on the CommitRange struct. RFAL.

@tbg
Copy link
Contributor Author

tbg commented Jun 19, 2019

RFAL @bdarnell. You only need to look at the additional fixup commits (which I'll patch into their respective base commits when you've looked at them). The first one is the salient one, the rest is just cleanup.

tbg added 4 commits June 19, 2019 14:19
The quorum package contains logic to reason about committed indexes as
well as vote outcomes for both majority and joint quorums. The package
is oblivious to the existence of learner replicas.

The plan is to hook this up to etcd/raft in subsequent commits.
Instead of having disjoint mappings of ID to *Progress for voters and
learners, use a map[id]struct{} for each and share a map of *Progress
among them.

This is easier to handle when joint quorums are introduced, at which
point a node may be a voting member of two quorums.
To ease a future transition into joint quorums, this commit removes the
previous "ad-hoc" majority-based quorum and vote computations with that
introduced in the `raft/quorum` package.

More specifically, the progressTracker now uses a quorum.JointConfig for
which the "second" majority quorum is always empty; in this case the
quorum behaves like the one quorum.MajorityConfig that is actually
present. Or, more briefly, this change is a no-op, but it will take the
busywork out of actually starting to make use of joint quorums in the
future.

On a side node, I suspect that this might've fixed a bug regarding the
read index though I haven't been able to explicitly come up with a
counter-example. The problem was that the acks collected for the read
index weren't taking into account membership changes, so they'd run the
danger of using acks from nodes since removed to claim that a quorum of
acks had been received. There's a chance that there isn't a
counter-example (the only guarantee extracted from the "quorum" is that
there isn't another leader, but even if there's another leader all that
matters is that that leader doesn't have a divergent history from the
stale leader in the hypothetical counter-example), but either way there
is morally a bug here that is now fixed because VoteCommitted doesn't
care about votes from members that are not voters known to the currently
active configuration.
@tbg
Copy link
Contributor Author

tbg commented Jun 20, 2019

TFTR! I force-pushed the squashed commit which I verified has an empty diff with the previous HEAD.

@tbg tbg merged commit 755aab6 into etcd-io:master Jun 20, 2019
@tbg tbg deleted the jointq-pr branch June 20, 2019 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants