You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on a procedure to upgrade the nodes in the raft cluster to a new version of the software. This means that also the leader needs to be replaced. Right now we're doing this by just killing the leader and letting raft figure out who the new leader is supposed to be by doing an election. Is this the right way of doing this?
If so, the timeouts in the default configuration seem to be quite high. With a heartbeat timeout and an election timeout of 1 second it seems quite common to have elections that take 2-3 seconds. We're thinking about lowering both to 100ms, to make the election happen quicker.
Do you see any problems with this? And what kind of LeaderLeaseTimeout would you recommend, 50ms or a 100ms as well or something else even?
The text was updated successfully, but these errors were encountered:
Oh, probably important information we're running the nodes in different AWS availability zones in the same region. When measuring the latency between the nodes it is usually below 1ms.
@i0rek is working on a PR for adding leadership transfer. For now, your approach is correct, but you might want to look at the transfer leadership stuff when it lands (should be soon).
Hey @JelteF , thank you so much for bringing this up!
The way you're doing it currently is correct, and @i0rek is working on a PR here to allow a leader to be chosen before transfer.
For suggestions on timeouts, that is really dependent on your set up. Since you're measuring your latency to be pretty low as you stated I would recommend changing the LeaderLeaseTimeout to 100ms before 50ms.
One key thing to watch is that's your resource utilization when you do this. Since you'll be sending a heartbeat more frequently, and this can add more stress on the machines.
I'd love to hear your findings to see if decreasing the timeout gave you any noticeable difference in your resource utilization.
I'll be closing this issue, but feel free to keep the discussion going.
As Sarah pointed out, when you're thinking about resource utilization, consider our Lifeguard work. We specifically looked at the effects of having a CPU-bound process that was alive, but couldn't respond to heartbeats (in the form of membership gossip) because the CPU was already saturated, producing a false positive in the failure detector. I suspect that as you lower the heartbeat frequency, you may encounter more of this thrashing if your nodes are doing enough work.
I'm working on a procedure to upgrade the nodes in the raft cluster to a new version of the software. This means that also the leader needs to be replaced. Right now we're doing this by just killing the leader and letting raft figure out who the new leader is supposed to be by doing an election. Is this the right way of doing this?
If so, the timeouts in the default configuration seem to be quite high. With a heartbeat timeout and an election timeout of 1 second it seems quite common to have elections that take 2-3 seconds. We're thinking about lowering both to 100ms, to make the election happen quicker.
Do you see any problems with this? And what kind of LeaderLeaseTimeout would you recommend, 50ms or a 100ms as well or something else even?
The text was updated successfully, but these errors were encountered: