Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdutil: remove client when etcd server is unhealthy #7729

Closed
wants to merge 4 commits into from

Conversation

HuSharp
Copy link
Member

@HuSharp HuSharp commented Jan 18, 2024

What problem does this PR solve?

Issue Number: Ref #7730

What is changed and how does it work?

Question summary:

  • We have a etcd endpoints updated mechanism which will be balanced in etcd client's balance mechanism
  • When the new PD leader sends an etcd request:
    • Write requests can only be handled by the etcd leader, at this time, the etcd leader does not have io hang problem, so there is no problem to write, and it can be elected as the PD leader successfully.
    • Read requests can be handled by all nodes and default to ReadIndex, at this point if the request is sent to a machine that has been injected with io hang, the read index is likely to get stuck and fail with a timeout because the application can't advance normally.

PR Summary:

  • When meeting frequently error, will remove client endpoints
  • client endpoints will be added after healthy check

TODO

using trend to update ref #7682

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Release note

None.

Copy link
Contributor

ti-chi-bot bot commented Jan 18, 2024

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

Copy link
Contributor

ti-chi-bot bot commented Jan 18, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 18, 2024
@ti-chi-bot ti-chi-bot bot requested review from nolouch and rleungx January 18, 2024 01:46
@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 18, 2024
@HuSharp HuSharp force-pushed the check_frequent_client branch 3 times, most recently from dcde812 to 61d2b4d Compare January 18, 2024 03:20
Copy link

codecov bot commented Jan 18, 2024

Codecov Report

Merging #7729 (008c3d4) into master (1a79fb0) will increase coverage by 0.01%.
The diff coverage is 97.22%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7729      +/-   ##
==========================================
+ Coverage   73.79%   73.80%   +0.01%     
==========================================
  Files         429      429              
  Lines       47568    47595      +27     
==========================================
+ Hits        35101    35126      +25     
+ Misses       9479     9478       -1     
- Partials     2988     2991       +3     
Flag Coverage Δ
unittests 73.80% <97.22%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@HuSharp HuSharp force-pushed the check_frequent_client branch 2 times, most recently from ecab0a8 to a383beb Compare January 18, 2024 03:43
Signed-off-by: husharp <jinhao.hu@pingcap.com>
Signed-off-by: husharp <jinhao.hu@pingcap.com>
@HuSharp HuSharp marked this pull request as ready for review January 18, 2024 04:54
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 18, 2024
Signed-off-by: husharp <jinhao.hu@pingcap.com>
@HuSharp
Copy link
Member Author

HuSharp commented Jan 18, 2024

/hold

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 18, 2024
@@ -1684,6 +1689,12 @@ func (s *Server) campaignLeader() {
log.Info(fmt.Sprintf("campaign %s leader meets error due to txn conflict, another PD/API server may campaign successfully", s.mode),
zap.String("campaign-leader-name", s.Name()))
} else {
// for frequently changed etcd leader, we should set client status to unhealthy
if strings.Contains(err.Error(), "frequently changed") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only this one kind of error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will being checked by trend

if status, ok := checker.isUnHealthy.Load(ep); ok && status.(bool) {
log.Info("some etcd server is unhealthy", zap.String("endpoint", ep))
// set unhealthy etcd client to healthy
checker.isUnHealthy.Delete(ep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove isUnHealthy here, will we receive "frequently changed" error again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trend check will make sure this node not participate in campaigning leader

@HuSharp
Copy link
Member Author

HuSharp commented Jan 18, 2024

Since we need to get the state of all the pd's and then temporarily remove them in current PD own etcd client checker, I'll implement it in the trend.

Close this pr now.

@HuSharp HuSharp closed this Jan 18, 2024
@HuSharp HuSharp deleted the check_frequent_client branch January 21, 2024 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants