Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: transfer leadership in case of error #12293

Merged
merged 5 commits into from
Mar 17, 2022
Merged

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Mar 14, 2022

When a Nomad server becomes the Raft leader, it must perform several
actions defined in the establishLeadership function. If any of these
actions fail, Raft will think the node is the leader, but it will not
actually be able to act as a Nomad leader.

In this scenario, leadership must be revoked and transferred to another
server if possible, or the node should retry the establishLeadership
steps.

Closes #8470


Note to reviewers: this work was basically lifted from hashicorp/consul#5247 and I wasn't able to create a reliable test, so I tested it manually with a forced error in one of the servers. The log output looked like this after forcing an election by shutting down the cluster leader:

    2022-03-14T16:35:05.080-0400 [WARN]  nomad.raft: Election timeout reached, restarting election
    2022-03-14T16:35:05.080-0400 [INFO]  nomad.raft: entering candidate state: node="Node at 192.168.1.173:4647 [Candidate]" term=4
    2022-03-14T16:35:05.368-0400 [INFO]  nomad.raft: election won: tally=2
    2022-03-14T16:35:05.368-0400 [INFO]  nomad.raft: entering leader state: leader="Node at 192.168.1.173:4647 [Leader]"
    2022-03-14T16:35:05.368-0400 [INFO]  nomad.raft: added peer, starting replication: peer=302a41de-8b12-1226-a826-8203c39d34bc
    2022-03-14T16:35:05.368-0400 [INFO]  nomad.raft: added peer, starting replication: peer=7e714096-eeb6-bc65-942f-81e7457984d6
    2022-03-14T16:35:05.368-0400 [INFO]  nomad: cluster leadership acquired
    2022-03-14T16:35:05.368-0400 [INFO]  nomad.raft: pipelining replication: peer="{Voter 7e714096-eeb6-bc65-942f-81e7457984d6 192.168.1.173:7677}"
    2022-03-14T16:35:05.550-0400 [ERROR] nomad: failed to establish leadership: error="i can't be a leader"
    2022-03-14T16:35:05.553-0400 [INFO]  nomad: successfully transferred leadership: attempt=0 retry_limit=3

I also paraphrased some of the original comments to test my own understanding of what's going on. If any of them seems incorrect it's probably my fault 😅

When a Nomad server becomes the Raft leader, it must perform several
actions defined in the establishLeadership function. If any of these
actions fail, Raft will think the node is the leader, but it will not
actually be able to act as a Nomad leader.

In this scenario, leadership must be revoked and transferred to another
server if possible, or the node should retry the establishLeadership
steps.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

nomad/leader.go Outdated
func (s *Server) leadershipTransfer() error {
retryCount := 3
for i := 0; i < retryCount; i++ {
future := s.raft.LeadershipTransfer()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LeadershipTransfer will fail if the Raft version is < 3, which would cause the cluster to get stuck in a leaderless infinite loop since we always retry if leadershipTransfer returns an error.

    2022-03-14T17:13:31.725-0400 [ERROR] nomad: failed to establish leadership: error="i can't be a leader"
    2022-03-14T17:13:31.725-0400 [ERROR] nomad: failed to transfer leadership attempt, will retry: attempt=0 retry_limit=3 error="operation not supported with current protocol version"
    2022-03-14T17:13:31.725-0400 [ERROR] nomad: failed to transfer leadership attempt, will retry: attempt=1 retry_limit=3 error="operation not supported with current protocol version"
    2022-03-14T17:13:31.725-0400 [ERROR] nomad: failed to transfer leadership attempt, will retry: attempt=2 retry_limit=3 error="operation not supported with current protocol version"
    2022-03-14T17:13:31.725-0400 [ERROR] nomad: failed to transfer leadership: error="failed to transfer leadership in 3 attempts"
    2022-03-14T17:13:36.822-0400 [ERROR] nomad: failed to establish leadership: error="i can't be a leader"
    2022-03-14T17:13:36.822-0400 [ERROR] nomad: failed to transfer leadership attempt, will retry: attempt=0 retry_limit=3 error="operation not supported with current protocol version"
    2022-03-14T17:13:36.822-0400 [ERROR] nomad: failed to transfer leadership attempt, will retry: attempt=1 retry_limit=3 error="operation not supported with current protocol version"
    2022-03-14T17:13:36.822-0400 [ERROR] nomad: failed to transfer leadership attempt, will retry: attempt=2 retry_limit=3 error="operation not supported with current protocol version"
    2022-03-14T17:13:36.822-0400 [ERROR] nomad: failed to transfer leadership: error="failed to transfer leadership in 3 attempts"
    2022-03-14T17:13:41.908-0400 [ERROR] nomad: failed to establish leadership: error="i can't be a leader"
    2022-03-14T17:13:41.908-0400 [ERROR] nomad: failed to transfer leadership attempt, will retry: attempt=0 retry_limit=3 error="operation not supported with current protocol version"
    2022-03-14T17:13:41.908-0400 [ERROR] nomad: failed to transfer leadership attempt, will retry: attempt=1 retry_limit=3 error="operation not supported with current protocol version"
    2022-03-14T17:13:41.908-0400 [ERROR] nomad: failed to transfer leadership attempt, will retry: attempt=2 retry_limit=3 error="operation not supported with current protocol version"
    2022-03-14T17:13:41.908-0400 [ERROR] nomad: failed to transfer leadership: error="failed to transfer leadership in 3 attempts"

Should we keep the previous (erroneous) behaviour if Raft is not v3? Or is there something else we can do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the previous (erroneous) behaviour if Raft is not v3?

That seems fine to me. Let's add a note in the upgrade guide about this behavior change. If nothing else it's a nice concrete feature to encourage upgrading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I will work on that, thanks!

Copy link
Contributor Author

@lgfa29 lgfa29 Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael After staring waaay too long at this, I realized that the previous behaviour is actually for the leader to get stuck, which is also the point of this fix, so there isn't much we can do about it.

In 2dc1ff1 I changed the logic a little to return early and with a more meaningful message if the problem is unsupported Raft version and in 772abb6 I added the upgrade guide notice.

@lgfa29 lgfa29 added this to the 1.3.0 milestone Mar 14, 2022
nomad/leader.go Outdated
func (s *Server) leadershipTransfer() error {
retryCount := 3
for i := 0; i < retryCount; i++ {
future := s.raft.LeadershipTransfer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the previous (erroneous) behaviour if Raft is not v3?

That seems fine to me. Let's add a note in the upgrade guide about this behavior change. If nothing else it's a nice concrete feature to encourage upgrading.

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Step down leadership on establishLeader failures
4 participants