Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check if server is in configuration when receiving a voteRequest #526

Merged
merged 5 commits into from
Oct 3, 2022

Conversation

dhiaayachi
Copy link
Contributor

@dhiaayachi dhiaayachi commented Sep 14, 2022

This PR change the check introduced in #477.

Based on the bug reported in #524, when a non voter join the cluster with a higher term (This can happen in the case of a returning node to the cluster after a restart, that was pruned by autopilot) it causes the cluster to become unstable.

The scenario where this happen is the following:

  • Stable cluster with node0, node1,node2. All nodes are voters
  • node0 is partitioned from the cluster
  • node0 is removed from the cluster by auto-pilote CleanupDeadServersIt can't see the change (its configuration still have node0, node1, node2) and it still think it's a Voter
  • the remaning cluster config is (node1,node2) a new leader is elected on term c1
  • node0 transition to candidate and run elections until reaching c2>c1, this happen because it still think it's part of the cluster and it's a voter.
  • partion is gone, the node try to rejoin the cluster (For Consul, this would happen through joining the serf pool, which allow the current cluster leader to add it back to the cluster)
  • node0 is added by default as nonVoter, until raft see it caught up enough to transition to voter
  • cluster replicate to node0 to try to make it aware of the new configuration and catch it up by performing an appendEntries CMD
  • node0 reject the appendEntries CMD and the cluster start a new election
  • a new leader is elected (not node0)
  • cluster replicate to node0 to try to make it aware of the new configuration and catch it up by performing an appendEntries CMD
  • node0 reject the appendEntries CMD and the cluster start a new election
  • ...

At the same time, node0 is still running as candidate (it still think it's part of the cluster and it's a voter) but, because of the fix introduced in #477, the request will be rejected which have the effect of increasing node0 term and make this unstable state permanent as node0 term is inflating and the cluster term will never catch-up to it.

The fix introduced in this PR break this loop, by allowing node0 to request a vote as a non voter which have the effect of levelling the term and let the cluster move forward. This is not ideal as it destabilize the cluster while node0 is guaranteed to not win the election.

An ideal fix would be to add pre-vote, #474 in conjunction to this to avoid term inflation and keep the cluster stable.

raft_test.go Outdated
t.Fatalf("err: %v", err)
}

//set that follower term to higher term to simulate a partitioning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to simulate a partitioning? Isn't that what c.Disconnect is doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes disconnect would prohibit communication from the specified node to any node and from any node to that node.

raft_test.go Outdated Show resolved Hide resolved
… the bug manifest

Co-authored-by: Nick Cabatoff <ncabatoff@hashicorp.com>
@szechuen
Copy link

Thanks for the new PR!

I'd still recommend reordering the term check as proposed in #525 (comment) instead of modifying the voter/server check, though, for 2 reasons:

  1. Raft should not be dependent on an external system (Serf) to add back the disconnected server as a non-voter for the cluster to make progress. Without which, inConfiguration will continue to be false and trap us in the livelock.
  2. The cluster needs to make progress but it shouldn't have to grant vote to a non-voter in order to do so. As long as its higher term count is matched, the remaining voters can still win and maintain a stable leadership which the disconnected server would follow. Granting votes to a server that has been explicitly added as a non-voter would in fact be an unexpected behavior which defeats the purpose of the ServerStabilizationTime config provided by raft-autopilot.

Any thoughts?

@dhiaayachi
Copy link
Contributor Author

Hey @szechuen
I'm not sure I fully understand the following statement:

Raft should not be dependent on an external system (Serf) to add back the disconnected server as a non-voter for the cluster to make progress. Without which, inConfiguration will continue to be false and trap us in the livelock.

Only a leader can add a server to the configuration and is responsible to replicate that configuration to other nodes. In that case I think it would not be possible to have the scenario you describe here happening, as this check only verify that a request for vote come from a server already in the configuration, which should be always the case from the perspective of the leader. Am I missing something?

@szechuen
Copy link

Hi @dhiaayachi, sorry for the late reply.

What I meant by (1) was that we need to consider the behavior when the disconnected server is removed but not subsequently added back to the cluster as a non-voter (by Serf). This is not hypothetical since Vault does not run Serf and only adds a server as part of its initial bootstrap: https://github.com/hashicorp/vault/blob/a2d818bf0ad59bac9a44b2e2a62bc2ae4bad49dd/vault/logical_system_raft.go#L351-L356

However, I'm mistaken that this would result in a livelock since the leader would not try to replicate to the removed server, and hence does not trigger the higher term step-down. On the contrary, only updating our term in requestVote if the requesting server is in configuration is the better behavior so that the cluster does not get repeatedly disrupted, since the disconnected server will keep restarting elections because we don't replicate to it.

The scenario can be reproduced by omitting https://github.com/hashicorp/raft/pull/526/files#diff-0022e96fabdba704c82038d060a73c969456f9d3a0ae768652d03c74e9fa36c1R2679-R2683 from the test. I think the best thing we can do is to maintain a stable leadership with the remaining servers and expect a manual rejoin of the disconnected server in that case.

Point (2) is still valid though. What I'm proposing is thus:

candidateID := ServerID(req.ID)

if len(r.configurations.latest.Servers) > 0 && !inConfiguration(r.configurations.latest, candidateID) {
  r.logger.Warn("ignoring vote request since node is not in configuration", "from", candidate)
  return
}

if req.Term < r.getCurrentTerm() {
  return
}
if req.Term > r.getCurrentTerm() {
  r.logger.Debug("lost leadership because received a requestVote with a newer term")
  r.setState(Follower)
  r.setCurrentTerm(req.Term)
  resp.Term = req.Term
}

if len(r.configurations.latest.Servers) > 0 && !hasVote(r.configurations.latest, candidateID) {
  r.logger.Warn("rejecting vote request since node is not a voter", "from", candidate)
  return
}
if leaderAddr, leaderID := r.LeaderWithID(); leaderAddr != "" && leaderAddr != candidate && !req.LeadershipTransfer {
  r.logger.Warn("rejecting vote request since we have a leader", "from", candidate, "leader", leaderAddr, "leader-id", string(leaderID))
  return
}

@dhiaayachi
Copy link
Contributor Author

@szechuen sorry for the delay, I was not able to go through your answer earlier.

If understand your proposal right, you are suggesting to:

  1. if node is not in config reject the vote request , which actually what is introduced in this PR
  2. If a node have a lower term reject the vote request
  3. If a node have higher term we would step down and accept the vote to permit to the cluster to make progress and solve the issue in hand and that would be for voters or nonVoters.
  4. if the node is nonVoter reject the vote (that leave us with only when we have equal term, as lower term would have been rejected in 2. and higher term would have been accepted in 3)
  5. if we have a leader refuse the Vote req

I think that would work. That said, I think based on the checks that we already have in place that would only cover differently the one case where we receive a Vote from a nonVoter with a term = current term. Am I missing something here?

If that's the case I don't think that's a realistic case, as if a node is nonVoter from the perspective of the leader and is on the same term, that mean that node know that it's a nonVoter and would never send a request for vote in the first place. That said, I think it's a no harm to add that extra check to protect from the case where we have a node with the same term but with completely different history.

What I meant by (1) was that we need to consider the behaviour when the disconnected server is removed but not subsequently added back to the cluster as a non-voter (by Serf). This is not hypothetical since Vault does not run Serf and only adds a server as part of its initial bootstrap

I think the right behaviour from Raft perspective is the one you observer, that mean that this node stay excluded from the cluster while the cluster stay stable.

IMHO raft should not manage adding or removing nodes automatically, those changes need to be done by calling the API (AddVoter, AddNonVoter, RemoveServer) and any non leader server that is not part of the stable configuration, managed by the leader, should not be able to overwrite that configuration by hijacking the leadership.
If someone want to implement a way of automatically adding and removing servers to raft it should be implemented separately by calling the API. For example that mechanism in Consul is managed by 2 components Serf and Autopilot (Serf would trigger adding the servers, Autopilot would remove them if detected dead). I'm not sure for Vault as I'm not familiar with the code base, but I would say that Vault servers need to rely on an external mechanism to add back dead servers cleaned up by raft-autopilot (maybe raft-autopilot?).

@szechuen
Copy link

If understand your proposal right, you are suggesting to:

  1. if node is not in config reject the vote request , which actually what is introduced in this PR
  2. If a node have a lower term reject the vote request
  3. If a node have higher term we would step down and accept the vote to permit to the cluster to make progress and solve the issue in hand and that would be for voters or nonVoters.
  4. if the node is nonVoter reject the vote (that leave us with only when we have equal term, as lower term would have been rejected in 2. and higher term would have been accepted in 3)
  5. if we have a leader refuse the Vote req

Yep this is mostly accurate, with the caveat that we are not accepting the vote immediately after stepping down in (3). We would only vote for the node if it is a voter.

I think that would work. That said, I think based on the checks that we already have in place that would only cover differently the one case where we receive a Vote from a nonVoter with a term = current term. Am I missing something here?

The difference here is that we would step down for a higher-term non-voter, but not vote for it.

I think the right behaviour from Raft perspective is the one you observer, that mean that this node stay excluded from the cluster while the cluster stay stable.

IMHO raft should not manage adding or removing nodes automatically, those changes need to be done by calling the API (AddVoter, AddNonVoter, RemoveServer) and any non leader server that is not part of the stable configuration, managed by the leader, should not be able to overwrite that configuration by hijacking the leadership.
If someone want to implement a way of automatically adding and removing servers to raft it should be implemented separately by calling the API. For example that mechanism in Consul is managed by 2 components Serf and Autopilot (Serf would trigger adding the servers, Autopilot would remove them if detected dead). I'm not sure for Vault as I'm not familiar with the code base, but I would say that Vault servers need to rely on an external mechanism to add back dead servers cleaned up by raft-autopilot (maybe raft-autopilot?).

Yep fully agreed. My point is that we need to consider and make sure the remaining cluster still ends up in a stable state when that happens.

@dhiaayachi
Copy link
Contributor Author

@szechuen Thank you for the clarification, I can see the nuance now and it seems reasonable to me.
That said, I think the suggested implementation would break the case where a leader is already in place because it would make the node step down and increase term even if we have a stable leader.
I tried to implement a slightly different logic based on your input, see my last commit.

@szechuen
Copy link

Yeah I think that makes sense. Even though the leader will eventually step down anyway once it starts replicating to the higher-term node (i.e. no functional difference here), this is the right logic in preparation of the pre-vote optimization where nodes should not be disrupted by a higher-term vote request if there's a stable leader.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dhiaayachi looks good to me although I have to admit despite following these PRs I still had to read through your PR description again to remember why this was needed and more correct than before!

My only inline feedback is to consider the comment on the non-obvious bit of code being added here to be even more explicit in acknowledging that this seems counter-intuitive but is there for a good reason!

Comment on lines +1597 to +1599
// if we get a request for vote from a nonVoter and the request term is higher,
// step down and update term, but reject the vote request
// This could happen when a node, previously voter, is converted to non-voter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth a note or link back to this issue here? I suspect reading this I'd be puzzled about why we allow a non-voter to "disrupt" leaders at all because the nuance involved in this issue is high! Ideally people would view blame before changing this back to ignore votes from non-voters but we could potentially use this comment to flag the non-intuitive behaviour is for a good reason?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants