Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lease: check etcd leader healthy by KeepAliveOnce #7670

Closed
wants to merge 2 commits into from

Conversation

HuSharp
Copy link
Member

@HuSharp HuSharp commented Jan 5, 2024

What problem does this PR solve?

Issue Number: Ref #7499

What is changed and how does it work?

  • As function Etcd.KeepAliveOnce will write to disk, we can record its operation time to check etcd healthy

stack is: KeepAliveOnce -> ls.hdr.fill(resp.Header) -> rev(readView.rev) -> tr.End -> metricsTxnWrite.End -> storeTxnWrite.End -> saveIndex -> UnsafePut

  • This check is more fine-grained than leaderCampaignTimes check

Check List

Tests

  • Integration test
  • Manual test (add detailed scripts or steps below)

Release note

None.

Signed-off-by: husharp <jinhao.hu@pingcap.com>
Copy link
Contributor

ti-chi-bot bot commented Jan 5, 2024

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot bot added the release-note-none Denotes a PR that doesn't merit a release note. label Jan 5, 2024
@ti-chi-bot ti-chi-bot bot requested review from HunDunDM and rleungx January 5, 2024 02:11
@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 5, 2024
@HuSharp HuSharp requested review from JmPotato and lhy1024 and removed request for HunDunDM January 5, 2024 02:11
Copy link

codecov bot commented Jan 5, 2024

Codecov Report

Merging #7670 (ceb0c7c) into master (335bd1e) will decrease coverage by 0.13%.
Report is 12 commits behind head on master.
The diff coverage is 55.07%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7670      +/-   ##
==========================================
- Coverage   73.89%   73.76%   -0.13%     
==========================================
  Files         429      429              
  Lines       47317    47413      +96     
==========================================
+ Hits        34965    34976      +11     
- Misses       9360     9429      +69     
- Partials     2992     3008      +16     
Flag Coverage Δ
unittests 73.76% <55.07%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

res, err := l.lease.KeepAliveOnce(ctx1, leaseID)
if err != nil {
log.Warn("lease keep alive failed", zap.String("purpose", l.Purpose), zap.Time("start", start), errs.ZapError(err))
return
}

if time.Since(start) < unhealthyTimesRecordTimeout {
l.removeUnHealthyTimesLock(start)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we only add the unhealthy time when it's greater than unhealthyTimesRecordTimeout rather than adding it first and then deleting it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. But it will rely on upper layer timeout mechanism which reaches 3s(or greater)

add and then remove will be more fine-grained.

But I am not sure keeping fine-grained under 3s is necessary

server/server.go Outdated
@@ -1781,6 +1783,13 @@ func (s *Server) campaignLeader() {
log.Info("etcd leader changed, resigns pd leadership", zap.String("old-pd-leader-name", s.Name()))
return
}
// check healthy status of etcd leader.
if s.member.GetLeadership().GetUnHealthyTimesNum() > unhealthyLeaderLeaseTimes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about doing this check at the beginning of the tick to make sure we could resign the etcd leader as soon as possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 50ms maybe can be ignored

Signed-off-by: husharp <jinhao.hu@pingcap.com>
Comment on lines +103 to +106
l := ls.lease.Load()
if l == nil {
return 0
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using ls.getLease() directly?

Comment on lines +112 to +115
l := ls.lease.Load()
if l == nil {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -1781,6 +1783,14 @@ func (s *Server) campaignLeader() {
log.Info("etcd leader changed, resigns pd leadership", zap.String("old-pd-leader-name", s.Name()))
return
}
// check healthy status of etcd leader.
if s.member.GetLeadership().GetUnHealthyTimesNum() >= unhealthyLeaderLeaseTimes {
if err := s.member.ResignEtcdLeader(ctx, s.member.Name(), ""); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better add a log here.

Comment on lines +38 to +39
unhealthyTimesRecordTimeout = 1 * time.Second
unhealthyTTLGCInterval = 5 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using a more conservative timeout value, e.g. the lease timeout of 3 seconds, since as long as the lease keeps alive successfully inside a lease timeout, it means the etcd leader is just fine. Based on this, manually controlling the GC might be a better option so we can clear the cache each time the lease is kept alive.

@HuSharp
Copy link
Member Author

HuSharp commented Jan 9, 2024

/hold

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2024
@HuSharp
Copy link
Member Author

HuSharp commented Feb 1, 2024

Fixed by #7737.

@HuSharp HuSharp closed this Feb 1, 2024
@HuSharp HuSharp deleted the check_unhealthy_lease branch February 1, 2024 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants