Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize failover time when the new primary node is down again #782

Merged
merged 3 commits into from
Jul 19, 2024

Conversation

enjoy-binbin
Copy link
Member

We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.

Let's assume the following scenario:

  1. Two replicas initiate an election.
  2. Replica 1 is elected as the primary node, and replica 2 does not have
    enough votes.
  3. Replica 1 is down, ie the new primary node down again in a short time.
  4. Replica 2 know that the new primary node is down and wants to initiate
    a failover, but because the failover_auth_time of the previous round
    has not been reset, it needs to wait for it to time out and then wait
    for the next retry time, which will take cluster-node-timeout * 4 times,
    this adds a lot of delay.

There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).

That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.

We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.

Let's assume the following scenario:
1. Two replicas initiate an election.
2. Replica 1 is elected as the primary node, and replica 2 does not have
   enough votes.
3. Replica 1 is down, ie the new primary node down again in a short time.
4. Replica 2 know that the new primary node is down and wants to initiate
   a failover, but because the failover_auth_time of the previous round
   has not been reset, it needs to wait for it to time out and then wait
   for the next retry time, which will take cluster-node-timeout * 4 times,
   this adds a lot of delay.

There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).

That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Jul 13, 2024

For the failover2 test case, with the default node-timeout 15s, and it runs a loop on the new and old code.

unstable (the 90s+ part is the case 1, and the 40s+ part is the case 2):

Execution time of different units:
  46 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  90 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  91 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  90 seconds - unit/cluster/failover2
  45 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2

this branch:

Execution time of different units:
  48 seconds - unit/cluster/failover2
  48 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2

using the node-timeout 5000:

unstable
Execution time of different units:
  25 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  40 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  24 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  42 seconds - unit/cluster/failover2
  40 seconds - unit/cluster/failover2

this branch

Execution time of different units:
  27 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  27 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  27 seconds - unit/cluster/failover2

Copy link

codecov bot commented Jul 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.25%. Comparing base (b4ac2c4) to head (1355c06).
Report is 5 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #782      +/-   ##
============================================
+ Coverage     70.23%   70.25%   +0.02%     
============================================
  Files           112      112              
  Lines         60602    60592      -10     
============================================
+ Hits          42563    42571       +8     
+ Misses        18039    18021      -18     
Files Coverage Δ
src/cluster_legacy.c 85.86% <100.00%> (+0.07%) ⬆️

... and 15 files with indirect coverage changes

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! some minor comments on the test.

tests/unit/cluster/failover2.tcl Outdated Show resolved Hide resolved
tests/unit/cluster/failover2.tcl Outdated Show resolved Hide resolved
tests/unit/cluster/failover2.tcl Outdated Show resolved Hide resolved
Signed-off-by: Binbin <binloveplay1314@qq.com>
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM.

@madolson madolson added the release-notes This issue should get a line item in the release notes label Jul 15, 2024
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

tests/support/util.tcl Outdated Show resolved Hide resolved
tests/unit/cluster/failover2.tcl Outdated Show resolved Hide resolved
Signed-off-by: Binbin <binloveplay1314@qq.com>
@hwware hwware merged commit 15a8290 into valkey-io:unstable Jul 19, 2024
20 checks passed
@enjoy-binbin enjoy-binbin deleted the optimize_failover branch July 20, 2024 03:59
hwware pushed a commit to hwware/valkey that referenced this pull request Jul 25, 2024
…y-io#782)

We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.

Let's assume the following scenario:
1. Two replicas initiate an election.
2. Replica 1 is elected as the primary node, and replica 2 does not have
   enough votes.
3. Replica 1 is down, ie the new primary node down again in a short
time.
4. Replica 2 know that the new primary node is down and wants to
initiate
   a failover, but because the failover_auth_time of the previous round
   has not been reset, it needs to wait for it to time out and then wait
for the next retry time, which will take cluster-node-timeout * 4 times,
   this adds a lot of delay.

There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to
initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).

That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants