-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lease: check etcd leader healthy by KeepAliveOnce #7670
Conversation
[REVIEW NOTIFICATION] This pull request has not been approved. To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #7670 +/- ##
==========================================
- Coverage 73.89% 73.76% -0.13%
==========================================
Files 429 429
Lines 47317 47413 +96
==========================================
+ Hits 34965 34976 +11
- Misses 9360 9429 +69
- Partials 2992 3008 +16
Flags with carried forward coverage won't be shown. Click here to find out more. |
res, err := l.lease.KeepAliveOnce(ctx1, leaseID) | ||
if err != nil { | ||
log.Warn("lease keep alive failed", zap.String("purpose", l.Purpose), zap.Time("start", start), errs.ZapError(err)) | ||
return | ||
} | ||
|
||
if time.Since(start) < unhealthyTimesRecordTimeout { | ||
l.removeUnHealthyTimesLock(start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we only add the unhealthy time when it's greater than unhealthyTimesRecordTimeout
rather than adding it first and then deleting it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can. But it will rely on upper layer timeout mechanism which reaches 3s(or greater)
add and then remove will be more fine-grained.
But I am not sure keeping fine-grained under 3s is necessary
server/server.go
Outdated
@@ -1781,6 +1783,13 @@ func (s *Server) campaignLeader() { | |||
log.Info("etcd leader changed, resigns pd leadership", zap.String("old-pd-leader-name", s.Name())) | |||
return | |||
} | |||
// check healthy status of etcd leader. | |||
if s.member.GetLeadership().GetUnHealthyTimesNum() > unhealthyLeaderLeaseTimes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about doing this check at the beginning of the tick to make sure we could resign the etcd leader as soon as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 50ms maybe can be ignored
Signed-off-by: husharp <jinhao.hu@pingcap.com>
l := ls.lease.Load() | ||
if l == nil { | ||
return 0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using ls.getLease()
directly?
l := ls.lease.Load() | ||
if l == nil { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
@@ -1781,6 +1783,14 @@ func (s *Server) campaignLeader() { | |||
log.Info("etcd leader changed, resigns pd leadership", zap.String("old-pd-leader-name", s.Name())) | |||
return | |||
} | |||
// check healthy status of etcd leader. | |||
if s.member.GetLeadership().GetUnHealthyTimesNum() >= unhealthyLeaderLeaseTimes { | |||
if err := s.member.ResignEtcdLeader(ctx, s.member.Name(), ""); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better add a log here.
unhealthyTimesRecordTimeout = 1 * time.Second | ||
unhealthyTTLGCInterval = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using a more conservative timeout value, e.g. the lease timeout of 3 seconds, since as long as the lease keeps alive successfully inside a lease timeout, it means the etcd leader is just fine. Based on this, manually controlling the GC might be a better option so we can clear the cache each time the lease is kept alive.
/hold |
Fixed by #7737. |
What problem does this PR solve?
Issue Number: Ref #7499
What is changed and how does it work?
Etcd.KeepAliveOnce
will write to disk, we can record its operation time to check etcd healthyleaderCampaignTimes
checkCheck List
Tests
Release note