Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdutil: consider the latency while patrolling the healthy endpoints #7737

Merged
merged 8 commits into from
Jan 30, 2024

Conversation

JmPotato
Copy link
Member

@JmPotato JmPotato commented Jan 19, 2024

What problem does this PR solve?

Issue Number: ref #7730, #7499.

What is changed and how does it work?

Consider the latency while patrolling the healthy endpoints to reduce the effect of slow nodes.
Now, there are the following strategies to select and remove unhealthy endpoints:

- Choose only the healthy endpoint within the lowest acceptable latency range.
- The evicted endpoint can only rejoin if it is selected again for three consecutive times.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Inject the etcd leader IO latency like:

image image

The duration of QPS being affected is reduced, and it may not necessarily hit rock bottom completely. It will only be impacted by the switch of the etcd leader.

image

Before:

image

Etcd leader changing will be finished in just one term.

image

Before:

image

Endpoint updating will be stabilized and only occur during IO hang injection and recovery points.

image

image

Before:

img_v3_0276_d4aef8d7-fb9e-4ae5-a0de-638c0905dccg

Release note

Enhance PD availability in the event of IO hang causing node unavailability.

Copy link
Contributor

ti-chi-bot bot commented Jan 19, 2024

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • HuSharp
  • rleungx

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 19, 2024
Copy link

codecov bot commented Jan 19, 2024

Codecov Report

Merging #7737 (ade43b6) into master (e8da033) will increase coverage by 0.02%.
The diff coverage is 95.27%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7737      +/-   ##
==========================================
+ Coverage   73.54%   73.57%   +0.02%     
==========================================
  Files         430      430              
  Lines       47645    47740      +95     
==========================================
+ Hits        35042    35126      +84     
- Misses       9605     9606       +1     
- Partials     2998     3008      +10     
Flag Coverage Δ
unittests 73.57% <95.27%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 22, 2024
Copy link
Member

@HuSharp HuSharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add some metrics to indicate healthy status?

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 22, 2024
@JmPotato
Copy link
Member Author

Do we need to add some metrics to indicate healthy status?

Added. PTAL.

@HuSharp
Copy link
Member

HuSharp commented Jan 23, 2024

Do we need to add some metrics to indicate healthy status?

Added. PTAL.

can you paste the metric pic in the pr description :)

@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 23, 2024
metrics/grafana/pd.json Outdated Show resolved Hide resolved
Copy link
Member

@HuSharp HuSharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM!

pkg/utils/etcdutil/health_checker.go Show resolved Hide resolved
Comment on lines 210 to 218
var (
lastEps = checker.client.Endpoints()
pickedEps = checker.pickEps(probeCh)
)
if len(pickedEps) > 0 {
checker.updateEvictedEps(lastEps, pickedEps)
pickedEps = checker.filterEps(pickedEps)
}
return lastEps, pickedEps, !typeutil.AreStringSlicesEquivalent(lastEps, pickedEps)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key modifications. You can specifically review this section.

pickedEps := make([]string, 0, len(eps))
for _, ep := range eps {
if count, ok := checker.evictedEps.Load(ep); ok {
if count.(int) < pickedCountThreshold {
Copy link
Member

@rleungx rleungx Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if all the eps are less than pickedCountThreshold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we might need to reset all, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should reset all endpoints when there are no available endpoints to improve availability, considering the situation can't get any worse?

Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <ghzpotato@gmail.com>
@ti-chi-bot ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jan 30, 2024
@JmPotato
Copy link
Member Author

/merge

Copy link
Contributor

ti-chi-bot bot commented Jan 30, 2024

@JmPotato: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Copy link
Contributor

ti-chi-bot bot commented Jan 30, 2024

This pull request has been accepted and is ready to merge.

Commit hash: ade43b6

@ti-chi-bot ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Jan 30, 2024
@ti-chi-bot ti-chi-bot bot merged commit 1c54865 into tikv:master Jan 30, 2024
25 of 26 checks passed
@JmPotato JmPotato deleted the cal branch January 30, 2024 08:22
ti-chi-bot bot pushed a commit that referenced this pull request Feb 4, 2024
close #7251

Cherry-pick the etcd client health checker improvements from #7725. #7727, #7743,  #7737 and #7779.

Signed-off-by: JmPotato <ghzpotato@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants