Optimize failover time when the new primary node is down again #782

enjoy-binbin · 2024-07-13T16:33:14Z

We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.

Let's assume the following scenario:

Two replicas initiate an election.
Replica 1 is elected as the primary node, and replica 2 does not have
enough votes.
Replica 1 is down, ie the new primary node down again in a short time.
Replica 2 know that the new primary node is down and wants to initiate
a failover, but because the failover_auth_time of the previous round
has not been reset, it needs to wait for it to time out and then wait
for the next retry time, which will take cluster-node-timeout * 4 times,
this adds a lot of delay.

There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).

That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.

We will not reset failover_auth_time after setting it, this is used to check auth_timeout and auth_retry_time, but we should at least reset it after a successful failover. Let's assume the following scenario: 1. Two replicas initiate an election. 2. Replica 1 is elected as the primary node, and replica 2 does not have enough votes. 3. Replica 1 is down, ie the new primary node down again in a short time. 4. Replica 2 know that the new primary node is down and wants to initiate a failover, but because the failover_auth_time of the previous round has not been reset, it needs to wait for it to time out and then wait for the next retry time, which will take cluster-node-timeout * 4 times, this adds a lot of delay. There is another problem. Like we will set additional random time for failover_auth_time, such as random 500ms and replicas ranking 1s. If replica 2 receives PONG from the new primary node before sending the FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will change itself to a replica. If the new primary node goes down again at this time, replica 2 will use the previous failover_auth_time to initiate an election instead of going through the logic of random 500ms and replicas ranking 1s again, which may lead to unexpected consequences (for example, a low-ranking replica initiates an election and becomes the new primary node). That is, we need to reset failover_auth_time at the appropriate time. When the replica switches to a new primary, we reset it, because the existing failover_auth_time is already out of date in this case. Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2024-07-13T16:35:41Z

For the failover2 test case, with the default node-timeout 15s, and it runs a loop on the new and old code.

unstable (the 90s+ part is the case 1, and the 40s+ part is the case 2):

Execution time of different units:
  46 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  90 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  91 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  90 seconds - unit/cluster/failover2
  45 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2

this branch:

Execution time of different units:
  48 seconds - unit/cluster/failover2
  48 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2

using the node-timeout 5000:

unstable
Execution time of different units:
  25 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  40 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  24 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  42 seconds - unit/cluster/failover2
  40 seconds - unit/cluster/failover2

this branch

Execution time of different units:
  27 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  27 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  27 seconds - unit/cluster/failover2

codecov · 2024-07-13T16:43:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.25%. Comparing base (b4ac2c4) to head (1355c06).
Report is 5 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #782      +/-   ##
============================================
+ Coverage     70.23%   70.25%   +0.02%     
============================================
  Files           112      112              
  Lines         60602    60592      -10     
============================================
+ Hits          42563    42571       +8     
+ Misses        18039    18021      -18

Files	Coverage Δ
src/cluster_legacy.c	`85.86% <100.00%> (+0.07%)`	⬆️

... and 15 files with indirect coverage changes

madolson

Nice find! some minor comments on the test.

tests/unit/cluster/failover2.tcl

Signed-off-by: Binbin <binloveplay1314@qq.com>

madolson

Nice! LGTM.

PingXie

LGTM.

tests/support/util.tcl

tests/unit/cluster/failover2.tcl

Signed-off-by: Binbin <binloveplay1314@qq.com>

…y-io#782) We will not reset failover_auth_time after setting it, this is used to check auth_timeout and auth_retry_time, but we should at least reset it after a successful failover. Let's assume the following scenario: 1. Two replicas initiate an election. 2. Replica 1 is elected as the primary node, and replica 2 does not have enough votes. 3. Replica 1 is down, ie the new primary node down again in a short time. 4. Replica 2 know that the new primary node is down and wants to initiate a failover, but because the failover_auth_time of the previous round has not been reset, it needs to wait for it to time out and then wait for the next retry time, which will take cluster-node-timeout * 4 times, this adds a lot of delay. There is another problem. Like we will set additional random time for failover_auth_time, such as random 500ms and replicas ranking 1s. If replica 2 receives PONG from the new primary node before sending the FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will change itself to a replica. If the new primary node goes down again at this time, replica 2 will use the previous failover_auth_time to initiate an election instead of going through the logic of random 500ms and replicas ranking 1s again, which may lead to unexpected consequences (for example, a low-ranking replica initiates an election and becomes the new primary node). That is, we need to reset failover_auth_time at the appropriate time. When the replica switches to a new primary, we reset it, because the existing failover_auth_time is already out of date in this case. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin requested review from zuiderkwast, PingXie and madolson July 13, 2024 16:36

madolson reviewed Jul 13, 2024

View reviewed changes

tests/unit/cluster/failover2.tcl Outdated Show resolved Hide resolved

tests/unit/cluster/failover2.tcl Outdated Show resolved Hide resolved

tests/unit/cluster/failover2.tcl Outdated Show resolved Hide resolved

update the tests

0f34526

Signed-off-by: Binbin <binloveplay1314@qq.com>

madolson approved these changes Jul 15, 2024

View reviewed changes

madolson added the release-notes This issue should get a line item in the release notes label Jul 15, 2024

PingXie approved these changes Jul 16, 2024

View reviewed changes

tests/support/util.tcl Outdated Show resolved Hide resolved

tests/unit/cluster/failover2.tcl Outdated Show resolved Hide resolved

code review from Ping

1355c06

Signed-off-by: Binbin <binloveplay1314@qq.com>

hwware approved these changes Jul 19, 2024

View reviewed changes

hwware merged commit 15a8290 into valkey-io:unstable Jul 19, 2024
20 checks passed

enjoy-binbin deleted the optimize_failover branch July 20, 2024 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize failover time when the new primary node is down again #782

Optimize failover time when the new primary node is down again #782

enjoy-binbin commented Jul 13, 2024

enjoy-binbin commented Jul 13, 2024 •

edited

Loading

codecov bot commented Jul 13, 2024 •

edited

Loading

madolson left a comment

madolson left a comment

PingXie left a comment

Optimize failover time when the new primary node is down again #782

Optimize failover time when the new primary node is down again #782

Conversation

enjoy-binbin commented Jul 13, 2024

enjoy-binbin commented Jul 13, 2024 • edited Loading

codecov bot commented Jul 13, 2024 • edited Loading

Codecov Report

madolson left a comment

Choose a reason for hiding this comment

madolson left a comment

Choose a reason for hiding this comment

PingXie left a comment

Choose a reason for hiding this comment

enjoy-binbin commented Jul 13, 2024 •

edited

Loading

codecov bot commented Jul 13, 2024 •

edited

Loading