-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 #494
Conversation
Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>
…ft into hridoyroy/heartbeat-backoff
if base > cap { | ||
return cap | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if base > cap { | |
return cap | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So long as the limit is less than or equal to 50 (assuming 10ms base wait time) then removing this should be fine. If we use a limit higher than that (which wouldn't make any logical sense as the wait time would be astronomically high) then we could run into integer overflow issues with the next multiplication.
This is practically safe but results in code with technically undefined behavior.
hi, I saw this behavior as well. When a voter stops, the leader's backoff mechanism waits for more than 10 seconds (10ms*1024) to send heartbeats and replicate messages to the voter. When the voter comes back up, it times out and triggers a new election. This could cause a leadership change or flap. Does it make sense to disable backoff for heartbeats? |
@mechpen We normally heartbeat every 1/10th of the heartbeat timeout. When these fail other warnings/errors are emitted in the logs. In these disconnected scenarios we are attempting to perform exponential backoff to ensure we don't needlessly fill up logs and use more network bandwidth than necessary. The bug we have encountered is where we backoff for far too long and get into the situation where a restarted server may hit its heartbeat timeout before the leader attempts it again. The solution proposed in this PR just drastically lowers the cap on how much we can backoff of the usual rate to ensure that we always send a heartbeat within the timeout value. It reduces the cap enough that practically speaking we only allow backoff by a factor of 5x the original value as opposed to the 100x it was previously. That 5x though will mean there are 1/5 the warning logs which could make figuring out whats going on during an incident a tiny bit easier so I think the solution implemented in this PR is probably the way to go. |
This PR aims to resolve https://hashicorp.atlassian.net/browse/VAULT-5310 , and the associated GH vault issue hashicorp/vault#14153.
Context: Currently, if the a follower is shut down for some amount of time, the leader cannot heartbeat to it and does an exponential backoff. When the follower restarts, the leader waits for longer than election_timeout to send a heartbeat to it, thus the follower starts an election and increases its term, which causes leadership to change.