Fix pre-vote implementation where leader's pre-vote is rejected #605

k-jingyang · 2024-08-17T02:37:28Z

Fixes #606

hashicorp-cla-app · 2024-08-17T02:37:42Z

All committers have signed the CLA.

hashicorp-cla-app · 2024-08-17T02:37:42Z

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

_{Have you signed the CLA already but the status is still pending? Recheck it.}

dhiaayachi · 2024-08-19T14:29:16Z

Hey @k-jingyang
Thank you for reporting this and providing a fix. I will go ahead and merge this once the CI pass.

Just making sure I fully grasp the impact of this bug. Before this fix, a leader who loses leadership, while pre-vote is on, will not be able to be reelected and leadership will likely transfer to another node but this won't introduce any extra scenarios where leadership become unstable or no leader is able to be elected.

k-jingyang · 2024-08-19T15:14:10Z

Hey @k-jingyang Thank you for reporting this and providing a fix. I will go ahead and merge this once the CI pass.

Just making sure I fully grasp the impact of this bug. Before this fix, a leader who loses leadership, while pre-vote is on, will not be able to be reelected and leadership will likely transfer to another node but this won't introduce any extra scenarios where leadership become unstable or no leader is able to be elected.

Thanks! I realised that the CI pipeline is failing, I tried to run make test and it failed even on the current master. Not sure if I'm missing something here.

Yes, you are correct. However, the situation becomes more serious in the following scenario:

3 node cluster
1 node is down, 2 nodes are still up (node A leader, node B)
node A somehow think that it lost leadership
node B can still heartbeat to node A, so it doesn't call for an election)
node A tries to call for an election again, but pre-vote prevents it from doing so, as node B keeps rejecting it as per the bug

This was what we encountered. I didn't manage to dig into how node A thought that it lost leadership.

So even though a majority in the cluster (2 out of 3) is still up, there are no actual leaders

dhiaayachi · 2024-08-19T17:21:28Z

@k-jingyang
I'm looking into the tests, it seems that tests are broken for some reason not related to this PR.

Yes, you are correct. However, the situation becomes more serious in the following scenario:

3 node cluster

1 node is down, 2 nodes are still up (node A leader, node B)

node A somehow think that it lost leadership

node B can still heartbeat to node A, so it doesn't call for an election)

node A tries to call for an election again, but pre-vote prevents it from doing so, as node B keeps rejecting it as per the bug

This was what we encountered. I didn't manage to dig into how node A thought that it lost leadership.

So even though a majority in the cluster (2 out of 3) is still up, there are no actual leaders

I'm not sure I understand? If node A think it's not a leader anymore it should stop sending heartbeats to all the other nodes which will lead to node B timing out the heartbeat and becoming a Candidate and eventually become a leader. Node B being a follower should not be sending heartbeats to node A, Am I missing something?

k-jingyang · 2024-08-20T00:13:08Z

@k-jingyang I'm looking into the tests, it seems that tests are broken for some reason not related to this PR.

Yes, you are correct. However, the situation becomes more serious in the following scenario:

3 node cluster

1 node is down, 2 nodes are still up (node A leader, node B)

node A somehow think that it lost leadership

node B can still heartbeat to node A, so it doesn't call for an election)

node A tries to call for an election again, but pre-vote prevents it from doing so, as node B keeps rejecting it as per the bug

This was what we encountered. I didn't manage to dig into how node A thought that it lost leadership.
So even though a majority in the cluster (2 out of 3) is still up, there are no actual leaders

I'm not sure I understand? If node A think it's not a leader anymore it should stop sending heartbeats to all the other nodes which will lead to node B timing out the heartbeat and becoming a Candidate and eventually become a leader. Node B being a follower should not be sending heartbeats to node A, Am I missing something?

I believe you're right. My understanding of the issue could be slightly wrong. Let me dig into the logs to see why node B still thinks node A was the leader even though node A is a Candidate.

k-jingyang · 2024-08-20T04:14:26Z

@dhiaayachi,

After digging through the logs, I don't have a definitive conclusion. You're right that only the leader should be heartbeat-ing to the follower, and once node A lost leadership, it should not heartbeat anymore.

What I can be sure of:

node B rejected node A's pre-vote because it still thinks node A was the leader (the bug)

# 1 of our log line
raft: rejecting pre-vote request since we have a leader from:"" leader:"NODE_A_IP" leader-id:NODE_A

Even as a Candidate, node A has logs regarding failed heartbeats to the downed node

# we log Stats() every second
# t=1
{..., "state":"Candidate","term":"1049"}

# t=2
raft: failed to heartbeat to peer=DOWNED_NODE backoff time:10m error:dial tcp DOWNED_NODE connect: connection refused

Since we don't seem to log successful heartbeats and heartbeat metrics has expired from our telemetry systems. I'm not sure if this meant that successful heartbeats were still ongoing to node B or not (even when node A is a Candidate). If so, this would explain why node B did not trigger an election. And if so, this would also indicate some form of bug or edge case

banks

@k-jingyang thanks for helping us understand this. I also think this fix is worth merging and should be possible soon, but it's really useful to us to understand the impact of the issue it might cause to help us decide how urgently we need to get the fix tested and merged into our products etc!

In your logs above how far in wall-clock time apart are the t=1 and t=2 logs? The old leader when it switches to candidate state would shut down replication goroutines which are the things sending the heartbeats and logging that error. If they are milliseconds apart I suspect that this is just a timing thing and the replication threads just happened to try sending a heartbeat around the time the leader changed, failed but the error didn't get logged until after the switch to candidate. If there are many seconds passing while the old node is a candidate but still attempting multiple times to send heartbeats then yep that sounds more concerning. Would you be able to post your whole logs either here or privately if we provide an email address?

raft.go

k-jingyang · 2024-08-20T15:08:27Z

@banks

The time between when the node became a candidate and when it failed to heartbeat were more than 10s apart. Regarding logs, it's a bit difficult to export, and tbh, it wouldn't be very helpful without query support. Let me check if I can share an investigation doc.

Somehow, I also refuse to believe that the library has such an edge case without someone facing this given how widely used this is. 😕

Let me spend some time to try to understand the issue a bit deeper.

dhiaayachi

Thank you again @k-jingyang! The tests should be better now if you rebase the PR on top of the latest changes from main. I left few suggestions.

raft.go

raft_test.go

k-jingyang · 2024-08-21T01:01:45Z

@dhiaayachi, you're welcome! I've addressed your comments! 😄

dhiaayachi

LGTM! Thank you for the contribution @k-jingyang.

I will wait for @banks to resolve the remaining conversation and merge this!

k-jingyang · 2024-08-22T01:47:40Z

Hey @banks, would it be possible to provide an email address so that I can share an investigation doc?

banks · 2024-08-22T12:41:16Z

Hey, You can use `consul-oss-debug@` our corporate domain. Let me know if you have any issues I checked that is still a thing but it's not been used for a little while here. In case anyone else comes across this message, this email alias is not monitored generally and is only used when Hashicorp Engineers specifically have requested something that is to private to share in a public GH issue or PR.

…

On Thu, Aug 22, 2024 at 2:48 AM jingyang ***@***.***> wrote: Hey @banks <https://github.com/banks>, would it be possible to provide an email address so that I can share an investigation doc? — Reply to this email directly, view it on GitHub <#605 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA5QU46APIT4YYWPRJT2CDZSU7NFAVCNFSM6AAAAABMVABUVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBTGUYDMNJRHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

k-jingyang · 2024-08-23T13:40:12Z

@banks Thanks! I've sent the doc over.

@dhiaayachi Can I know if we have an estimated date for the v1.7.1?

k-jingyang requested review from a team as code owners August 17, 2024 02:37

k-jingyang requested review from rboyer and removed request for a team August 17, 2024 02:37

jmurret requested a review from dhiaayachi August 17, 2024 15:35

banks reviewed Aug 20, 2024

View reviewed changes

raft.go Show resolved Hide resolved

dhiaayachi requested changes Aug 20, 2024

View reviewed changes

raft.go Show resolved Hide resolved

raft_test.go Show resolved Hide resolved

k-jingyang force-pushed the main branch from fc5ab69 to 5dca3d1 Compare August 21, 2024 00:52

k-jingyang added 2 commits August 21, 2024 08:54

fix pre-vote to correctly identify candidate addr

8d673ac

add test that pre-vote should not reject leader

3aaa156

k-jingyang force-pushed the main branch from 5dca3d1 to 3aaa156 Compare August 21, 2024 00:54

add TestRaft_PreVote_ShouldRejectNonLeader

ccb2857

fix test to use Leader() to avoid race condition

e2ac806

dhiaayachi approved these changes Aug 21, 2024

View reviewed changes

dhiaayachi merged commit 497108f into hashicorp:main Aug 22, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pre-vote implementation where leader's pre-vote is rejected #605

Fix pre-vote implementation where leader's pre-vote is rejected #605

k-jingyang commented Aug 17, 2024 •

edited

Loading

hashicorp-cla-app bot commented Aug 17, 2024 •

edited

Loading

hashicorp-cla-app bot commented Aug 17, 2024

dhiaayachi commented Aug 19, 2024

k-jingyang commented Aug 19, 2024 •

edited

Loading

dhiaayachi commented Aug 19, 2024

k-jingyang commented Aug 20, 2024

k-jingyang commented Aug 20, 2024 •

edited

Loading

banks left a comment

k-jingyang commented Aug 20, 2024 •

edited

Loading

dhiaayachi left a comment

k-jingyang commented Aug 21, 2024

dhiaayachi left a comment

k-jingyang commented Aug 22, 2024

banks commented Aug 22, 2024 via email

k-jingyang commented Aug 23, 2024

Fix pre-vote implementation where leader's pre-vote is rejected #605

Fix pre-vote implementation where leader's pre-vote is rejected #605

Conversation

k-jingyang commented Aug 17, 2024 • edited Loading

hashicorp-cla-app bot commented Aug 17, 2024 • edited Loading

hashicorp-cla-app bot commented Aug 17, 2024

dhiaayachi commented Aug 19, 2024

k-jingyang commented Aug 19, 2024 • edited Loading

dhiaayachi commented Aug 19, 2024

k-jingyang commented Aug 20, 2024

k-jingyang commented Aug 20, 2024 • edited Loading

banks left a comment

Choose a reason for hiding this comment

k-jingyang commented Aug 20, 2024 • edited Loading

dhiaayachi left a comment

Choose a reason for hiding this comment

k-jingyang commented Aug 21, 2024

dhiaayachi left a comment

Choose a reason for hiding this comment

k-jingyang commented Aug 22, 2024

banks commented Aug 22, 2024 via email

k-jingyang commented Aug 23, 2024

k-jingyang commented Aug 17, 2024 •

edited

Loading

hashicorp-cla-app bot commented Aug 17, 2024 •

edited

Loading

k-jingyang commented Aug 19, 2024 •

edited

Loading

k-jingyang commented Aug 20, 2024 •

edited

Loading

k-jingyang commented Aug 20, 2024 •

edited

Loading