Improve cluster cant failover log conditions #780

enjoy-binbin · 2024-07-13T10:13:40Z

This PR adjusts the logging conditions of clusterLogCantFailover
in this two ways.

For the same cant_failover_reason, we will print the log once
in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which
is a bit long, shorten it to 1s, so we can better track its state.
We get to see the system making progress by watching the message.
Using 1s also covers pretty much all cases as i don't see a reason
for using a <1s node timeout, test or prod.
We will not print logs before the nolog_fail_time, its value
is cluster-node-timeout+5000. This may casue us to lose some logs,
for example, if cluster-node-timeout is small, auth_timeout will
be 2000, and auth_retry_time will be 4000. In this case, we will
lose all the reasons during the election if the failover is timedout.
So remove the nolog_fail_time logic, since we still do have the
CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many
logs.

This PR adjusts the logging conditions of clusterLogCantFailover in this two ways. 1. For the same cant_failover_reason, we will print the log once in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which is a bit long, shorten it to 5s, so we can better track its state. 2. We will not print logs before the nolog_fail_time, its value is cluster-node-timeout+5000. This may casue us to lose some logs, for example, if cluster-node-timeout is small, auth_timeout will be 2000, and auth_retry_time will be 4000. In this case, we will lose all the reasons during the election if the failover is timedout. So remove the nolog_fail_time logic, since we still do have the CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, We won't print too many logs. Signed-off-by: Binbin <binloveplay1314@qq.com>

codecov · 2024-07-13T10:25:07Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.05%. Comparing base (a4ee8da) to head (9e050f8).
Report is 72 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #780      +/-   ##
============================================
+ Coverage     70.04%   70.05%   +0.01%     
============================================
  Files           112      112              
  Lines         60602    60587      -15     
============================================
- Hits          42447    42445       -2     
+ Misses        18155    18142      -13

Files	Coverage Δ
src/cluster_legacy.c	`85.97% <ø> (+0.18%)`	⬆️

... and 15 files with indirect coverage changes

src/cluster_legacy.h

Signed-off-by: Binbin <binloveplay1314@qq.com>

This PR adjusts the logging conditions of clusterLogCantFailover in this two ways. 1. For the same cant_failover_reason, we will print the log once in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which is a bit long, shorten it to 1s, so we can better track its state. We get to see the system making progress by watching the message. Using 1s also covers pretty much all cases as i don't see a reason for using a <1s node timeout, test or prod. 2. We will not print logs before the nolog_fail_time, its value is cluster-node-timeout+5000. This may casue us to lose some logs, for example, if cluster-node-timeout is small, auth_timeout will be 2000, and auth_retry_time will be 4000. In this case, we will lose all the reasons during the election if the failover is timedout. So remove the nolog_fail_time logic, since we still do have the CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many logs. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: mwish <maplewish117@gmail.com>

This PR adjusts the logging conditions of clusterLogCantFailover in this two ways. 1. For the same cant_failover_reason, we will print the log once in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which is a bit long, shorten it to 1s, so we can better track its state. We get to see the system making progress by watching the message. Using 1s also covers pretty much all cases as i don't see a reason for using a <1s node timeout, test or prod. 2. We will not print logs before the nolog_fail_time, its value is cluster-node-timeout+5000. This may casue us to lose some logs, for example, if cluster-node-timeout is small, auth_timeout will be 2000, and auth_retry_time will be 4000. In this case, we will lose all the reasons during the election if the failover is timedout. So remove the nolog_fail_time logic, since we still do have the CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many logs. Signed-off-by: Binbin <binloveplay1314@qq.com>

If a replica is step into data_age too old stage, it can not trigger the failover and currently it can not be automatically recovered and we will print a log every CLUSTER_CANT_FAILOVER_RELOG_PERIOD, which is every second. If the primary has not recovered or there is no manual failover, this log will flood the log file. In this case, limit its frequency to 10 times period, which is 10 seconds in our code. Also in this data_age too old stage, the repeated logs also can stand for the progress of the failover. See also valkey-io#780 for more details about it. Signed-off-by: Binbin <binloveplay1314@qq.com>

If a replica is step into data_age too old stage, it can not trigger the failover and currently it can not be automatically recovered and we will print a log every CLUSTER_CANT_FAILOVER_RELOG_PERIOD, which is every second. If the primary has not recovered or there is no manual failover, this log will flood the log file. In this case, limit its frequency to 10 times period, which is 10 seconds in our code. Also in this data_age too old stage, the repeated logs also can stand for the progress of the failover. See also #780 for more details about it. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>

enjoy-binbin requested a review from PingXie July 15, 2024 07:02

PingXie reviewed Jul 16, 2024

View reviewed changes

src/cluster_legacy.h Outdated Show resolved Hide resolved

change CLUSTER_CANT_FAILOVER_RELOG_PERIOD to 1s

9e050f8

Signed-off-by: Binbin <binloveplay1314@qq.com>

PingXie approved these changes Jul 17, 2024

View reviewed changes

enjoy-binbin requested review from zuiderkwast and madolson July 19, 2024 15:31

enjoy-binbin merged commit 380f700 into valkey-io:unstable Aug 6, 2024
20 checks passed

enjoy-binbin deleted the remove_nolog_fail_time branch August 6, 2024 13:14

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Aug 22, 2024

enjoy-binbin mentioned this pull request Oct 18, 2024

Limit CLUSTER_CANT_FAILOVER_DATA_AGE log to 10 times period #1189

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cluster cant failover log conditions #780

Improve cluster cant failover log conditions #780

enjoy-binbin commented Jul 13, 2024 •

edited

Loading

codecov bot commented Jul 13, 2024 •

edited

Loading

Improve cluster cant failover log conditions #780

Improve cluster cant failover log conditions #780

Conversation

enjoy-binbin commented Jul 13, 2024 • edited Loading

codecov bot commented Jul 13, 2024 • edited Loading

Codecov Report

enjoy-binbin commented Jul 13, 2024 •

edited

Loading

codecov bot commented Jul 13, 2024 •

edited

Loading