lease: check etcd leader healthy by KeepAliveOnce #7670

HuSharp · 2024-01-05T02:11:21Z

What problem does this PR solve?

Issue Number: Ref #7499

What is changed and how does it work?

As function Etcd.KeepAliveOnce will write to disk, we can record its operation time to check etcd healthy

stack is: KeepAliveOnce -> ls.hdr.fill(resp.Header) -> rev(readView.rev) -> tr.End -> metricsTxnWrite.End -> storeTxnWrite.End -> saveIndex -> UnsafePut

This check is more fine-grained than leaderCampaignTimes check

Check List

Tests

Integration test
Manual test (add detailed scripts or steps below)

Release note

None.

Signed-off-by: husharp <jinhao.hu@pingcap.com>

ti-chi-bot · 2024-01-05T02:11:24Z

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

codecov · 2024-01-05T02:21:18Z

Codecov Report

Merging #7670 (ceb0c7c) into master (335bd1e) will decrease coverage by 0.13%.
Report is 12 commits behind head on master.
The diff coverage is 55.07%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7670      +/-   ##
==========================================
- Coverage   73.89%   73.76%   -0.13%     
==========================================
  Files         429      429              
  Lines       47317    47413      +96     
==========================================
+ Hits        34965    34976      +11     
- Misses       9360     9429      +69     
- Partials     2992     3008      +16

Flag	Coverage Δ
unittests	`73.76% <55.07%> (-0.13%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

JmPotato · 2024-01-05T02:39:33Z

pkg/election/lease.go

 				res, err := l.lease.KeepAliveOnce(ctx1, leaseID)
 				if err != nil {
 					log.Warn("lease keep alive failed", zap.String("purpose", l.Purpose), zap.Time("start", start), errs.ZapError(err))
 					return
 				}
+
+				if time.Since(start) < unhealthyTimesRecordTimeout {
+					l.removeUnHealthyTimesLock(start)


Could we only add the unhealthy time when it's greater than unhealthyTimesRecordTimeout rather than adding it first and then deleting it?

Yes, we can. But it will rely on upper layer timeout mechanism which reaches 3s(or greater)

add and then remove will be more fine-grained.

But I am not sure keeping fine-grained under 3s is necessary

JmPotato · 2024-01-05T02:40:50Z

server/server.go

@@ -1781,6 +1783,13 @@ func (s *Server) campaignLeader() {
 				log.Info("etcd leader changed, resigns pd leadership", zap.String("old-pd-leader-name", s.Name()))
 				return
 			}
+			// check healthy status of etcd leader.
+			if s.member.GetLeadership().GetUnHealthyTimesNum() > unhealthyLeaderLeaseTimes {


What about doing this check at the beginning of the tick to make sure we could resign the etcd leader as soon as possible?

I think 50ms maybe can be ignored

Signed-off-by: husharp <jinhao.hu@pingcap.com>

JmPotato · 2024-01-05T05:50:07Z

pkg/election/leadership.go

+	l := ls.lease.Load()
+	if l == nil {
+		return 0
+	}


What about using ls.getLease() directly?

JmPotato · 2024-01-05T05:50:15Z

pkg/election/leadership.go

+	l := ls.lease.Load()
+	if l == nil {
+		return
+	}


JmPotato · 2024-01-05T05:51:14Z

server/server.go

@@ -1781,6 +1783,14 @@ func (s *Server) campaignLeader() {
 				log.Info("etcd leader changed, resigns pd leadership", zap.String("old-pd-leader-name", s.Name()))
 				return
 			}
+			// check healthy status of etcd leader.
+			if s.member.GetLeadership().GetUnHealthyTimesNum() >= unhealthyLeaderLeaseTimes {
+				if err := s.member.ResignEtcdLeader(ctx, s.member.Name(), ""); err != nil {


Better add a log here.

JmPotato · 2024-01-05T05:55:53Z

pkg/election/lease.go

+	unhealthyTimesRecordTimeout = 1 * time.Second
+	unhealthyTTLGCInterval      = 5 * time.Second


I prefer using a more conservative timeout value, e.g. the lease timeout of 3 seconds, since as long as the lease keeps alive successfully inside a lease timeout, it means the etcd leader is just fine. Based on this, manually controlling the GC might be a better option so we can clear the cache each time the lease is kept alive.

HuSharp · 2024-01-09T11:21:25Z

/hold

HuSharp · 2024-02-01T08:04:43Z

Fixed by #7737.

update

34cb477

Signed-off-by: husharp <jinhao.hu@pingcap.com>

ti-chi-bot bot added the release-note-none Denotes a PR that doesn't merit a release note. label Jan 5, 2024

ti-chi-bot bot requested review from HunDunDM and rleungx January 5, 2024 02:11

ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 5, 2024

HuSharp requested review from JmPotato and lhy1024 and removed request for HunDunDM January 5, 2024 02:11

JmPotato reviewed Jan 5, 2024

View reviewed changes

ahead resign

ceb0c7c

Signed-off-by: husharp <jinhao.hu@pingcap.com>

JmPotato reviewed Jan 5, 2024

View reviewed changes

ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2024

HuSharp closed this Feb 1, 2024

HuSharp deleted the check_unhealthy_lease branch February 1, 2024 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lease: check etcd leader healthy by KeepAliveOnce #7670

lease: check etcd leader healthy by KeepAliveOnce #7670

HuSharp commented Jan 5, 2024 •

edited

Loading

ti-chi-bot bot commented Jan 5, 2024

codecov bot commented Jan 5, 2024 •

edited

Loading

JmPotato Jan 5, 2024

HuSharp Jan 5, 2024

JmPotato Jan 5, 2024

HuSharp Jan 5, 2024

JmPotato Jan 5, 2024

JmPotato Jan 5, 2024

JmPotato Jan 5, 2024

JmPotato Jan 5, 2024

HuSharp commented Jan 9, 2024

HuSharp commented Feb 1, 2024

		unhealthyTimesRecordTimeout = 1 * time.Second
		unhealthyTTLGCInterval = 5 * time.Second

lease: check etcd leader healthy by KeepAliveOnce #7670

lease: check etcd leader healthy by KeepAliveOnce #7670

Conversation

HuSharp commented Jan 5, 2024 • edited Loading

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

ti-chi-bot bot commented Jan 5, 2024

codecov bot commented Jan 5, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuSharp commented Jan 9, 2024

HuSharp commented Feb 1, 2024

HuSharp commented Jan 5, 2024 •

edited

Loading

codecov bot commented Jan 5, 2024 •

edited

Loading