Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 #494

Merged
merged 10 commits into from
May 9, 2022

Conversation

HridoyRoy
Copy link
Contributor

@HridoyRoy HridoyRoy commented Mar 9, 2022

This PR aims to resolve https://hashicorp.atlassian.net/browse/VAULT-5310 , and the associated GH vault issue hashicorp/vault#14153.

Context: Currently, if the a follower is shut down for some amount of time, the leader cannot heartbeat to it and does an exponential backoff. When the follower restarts, the leader waits for longer than election_timeout to send a heartbeat to it, thus the follower starts an election and increases its term, which causes leadership to change.

@hashicorp-cla
Copy link

hashicorp-cla commented Mar 12, 2022

CLA assistant check
All committers have signed the CLA.

util.go Outdated Show resolved Hide resolved
@HridoyRoy HridoyRoy marked this pull request as ready for review March 16, 2022 18:35
@HridoyRoy HridoyRoy changed the title cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 Cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 Mar 16, 2022
util.go Show resolved Hide resolved
@briankassouf briankassouf requested a review from mkeeler March 16, 2022 18:51
replication.go Outdated Show resolved Hide resolved
HridoyRoy and others added 3 commits March 17, 2022 11:27
Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>
@HridoyRoy HridoyRoy requested a review from briankassouf March 17, 2022 18:36
Comment on lines +152 to +154
if base > cap {
return cap
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if base > cap {
return cap
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So long as the limit is less than or equal to 50 (assuming 10ms base wait time) then removing this should be fine. If we use a limit higher than that (which wouldn't make any logical sense as the wait time would be astronomically high) then we could run into integer overflow issues with the next multiplication.

This is practically safe but results in code with technically undefined behavior.

raft_test.go Outdated Show resolved Hide resolved
@mechpen
Copy link

mechpen commented Apr 8, 2022

hi, I saw this behavior as well.

When a voter stops, the leader's backoff mechanism waits for more than 10 seconds (10ms*1024) to send heartbeats and replicate messages to the voter. When the voter comes back up, it times out and triggers a new election. This could cause a leadership change or flap.

Does it make sense to disable backoff for heartbeats?

@mkeeler
Copy link
Member

mkeeler commented Apr 11, 2022

@mechpen We normally heartbeat every 1/10th of the heartbeat timeout. When these fail other warnings/errors are emitted in the logs. In these disconnected scenarios we are attempting to perform exponential backoff to ensure we don't needlessly fill up logs and use more network bandwidth than necessary. The bug we have encountered is where we backoff for far too long and get into the situation where a restarted server may hit its heartbeat timeout before the leader attempts it again.

The solution proposed in this PR just drastically lowers the cap on how much we can backoff of the usual rate to ensure that we always send a heartbeat within the timeout value. It reduces the cap enough that practically speaking we only allow backoff by a factor of 5x the original value as opposed to the 100x it was previously. That 5x though will mean there are 1/5 the warning logs which could make figuring out whats going on during an incident a tiny bit easier so I think the solution implemented in this PR is probably the way to go.

@HridoyRoy HridoyRoy requested review from mkeeler and briankassouf May 9, 2022 18:50
@HridoyRoy HridoyRoy merged commit 9174562 into main May 9, 2022
@HridoyRoy HridoyRoy deleted the hridoyroy/heartbeat-backoff branch May 9, 2022 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants