-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdutil: consider the latency while patrolling the healthy endpoints #7737
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #7737 +/- ##
==========================================
+ Coverage 73.54% 73.57% +0.02%
==========================================
Files 430 430
Lines 47645 47740 +95
==========================================
+ Hits 35042 35126 +84
- Misses 9605 9606 +1
- Partials 2998 3008 +10
Flags with carried forward coverage won't be shown. Click here to find out more. |
06f8682
to
862d2af
Compare
8ff4354
to
41f7bd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add some metrics to indicate healthy status?
Added. PTAL. |
can you paste the metric pic in the pr description :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM!
pkg/utils/etcdutil/health_checker.go
Outdated
var ( | ||
lastEps = checker.client.Endpoints() | ||
pickedEps = checker.pickEps(probeCh) | ||
) | ||
if len(pickedEps) > 0 { | ||
checker.updateEvictedEps(lastEps, pickedEps) | ||
pickedEps = checker.filterEps(pickedEps) | ||
} | ||
return lastEps, pickedEps, !typeutil.AreStringSlicesEquivalent(lastEps, pickedEps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key modifications. You can specifically review this section.
pickedEps := make([]string, 0, len(eps)) | ||
for _, ep := range eps { | ||
if count, ok := checker.evictedEps.Load(ep); ok { | ||
if count.(int) < pickedCountThreshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if all the eps are less than pickedCountThreshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the final picked endpoint is empty, we will not use it to replace the last:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, we might need to reset all, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should reset all endpoints when there are no available endpoints to improve availability, considering the situation can't get any worse?
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
/merge |
@JmPotato: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: ade43b6
|
What problem does this PR solve?
Issue Number: ref #7730, #7499.
What is changed and how does it work?
Check List
Tests
Inject the etcd leader IO latency like:
The duration of QPS being affected is reduced, and it may not necessarily hit rock bottom completely. It will only be impacted by the switch of the etcd leader.
Before:
Etcd leader changing will be finished in just one term.
Before:
Endpoint updating will be stabilized and only occur during IO hang injection and recovery points.
Before:
Release note