-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix pre-vote implementation where leader's pre-vote is rejected #605
Conversation
Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement Learn more about why HashiCorp requires a CLA and what the CLA includes Have you signed the CLA already but the status is still pending? Recheck it. |
Hey @k-jingyang Just making sure I fully grasp the impact of this bug. Before this fix, a leader who loses leadership, while pre-vote is on, will not be able to be reelected and leadership will likely transfer to another node but this won't introduce any extra scenarios where leadership become unstable or no leader is able to be elected. |
Thanks! I realised that the CI pipeline is failing, I tried to run Yes, you are correct. However, the situation becomes more serious in the following scenario:
This was what we encountered. I didn't manage to dig into how node A thought that it lost leadership. So even though a majority in the cluster (2 out of 3) is still up, there are no actual leaders |
@k-jingyang
I'm not sure I understand? If |
I believe you're right. My understanding of the issue could be slightly wrong. Let me dig into the logs to see why node B still thinks node A was the leader even though node A is a Candidate. |
After digging through the logs, I don't have a definitive conclusion. You're right that only the leader should be heartbeat-ing to the follower, and once node A lost leadership, it should not heartbeat anymore. What I can be sure of:
# 1 of our log line
raft: rejecting pre-vote request since we have a leader from:"" leader:"NODE_A_IP" leader-id:NODE_A
# we log Stats() every second
# t=1
{..., "state":"Candidate","term":"1049"}
# t=2
raft: failed to heartbeat to peer=DOWNED_NODE backoff time:10m error:dial tcp DOWNED_NODE connect: connection refused
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@k-jingyang thanks for helping us understand this. I also think this fix is worth merging and should be possible soon, but it's really useful to us to understand the impact of the issue it might cause to help us decide how urgently we need to get the fix tested and merged into our products etc!
In your logs above how far in wall-clock time apart are the t=1 and t=2 logs? The old leader when it switches to candidate state would shut down replication goroutines which are the things sending the heartbeats and logging that error. If they are milliseconds apart I suspect that this is just a timing thing and the replication threads just happened to try sending a heartbeat around the time the leader changed, failed but the error didn't get logged until after the switch to candidate. If there are many seconds passing while the old node is a candidate but still attempting multiple times to send heartbeats then yep that sounds more concerning. Would you be able to post your whole logs either here or privately if we provide an email address?
The time between when the node became a candidate and when it failed to heartbeat were more than 10s apart. Regarding logs, it's a bit difficult to export, and tbh, it wouldn't be very helpful without query support. Let me check if I can share an investigation doc. Somehow, I also refuse to believe that the library has such an edge case without someone facing this given how widely used this is. 😕 Let me spend some time to try to understand the issue a bit deeper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again @k-jingyang! The tests should be better now if you rebase the PR on top of the latest changes from main
. I left few suggestions.
@dhiaayachi, you're welcome! I've addressed your comments! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for the contribution @k-jingyang.
I will wait for @banks to resolve the remaining conversation and merge this!
Hey @banks, would it be possible to provide an email address so that I can share an investigation doc? |
Hey, You can use `consul-oss-debug@` our corporate domain. Let me know if
you have any issues I checked that is still a thing but it's not been used
for a little while here. In case anyone else comes across this message,
this email alias is not monitored generally and is only used when Hashicorp
Engineers specifically have requested something that is to private to share
in a public GH issue or PR.
…On Thu, Aug 22, 2024 at 2:48 AM jingyang ***@***.***> wrote:
Hey @banks <https://github.com/banks>, would it be possible to provide an
email address so that I can share an investigation doc?
—
Reply to this email directly, view it on GitHub
<#605 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA5QU46APIT4YYWPRJT2CDZSU7NFAVCNFSM6AAAAABMVABUVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBTGUYDMNJRHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@banks Thanks! I've sent the doc over. @dhiaayachi Can I know if we have an estimated date for the v1.7.1? |
Fixes #606