Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdutil: add dial keep alive params to switch connect as soon as possible #6059

Merged
merged 8 commits into from
Mar 7, 2023

Conversation

lhy1024
Copy link
Contributor

@lhy1024 lhy1024 commented Feb 28, 2023

What problem does this PR solve?

Issue Number: Close #6053

What is changed and how does it work?

After #6046, we support multi endpoint client, but it can not switch connect as soon as possible when endpoint hang. so in this pr I add timeout param.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

In the same network isolation test

Before #6046

907b8efd-bf7f-4f78-9561-631ff631bc26

After #6046
622ee73c-bc82-4c7a-92fc-5ec9df6c119a

After #6059

img_v2_fbcc94ee-d64e-421e-ad70-9b78bb82e25g

Release note

None.

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Feb 28, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • bufferflies
  • nolouch

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@codecov
Copy link

codecov bot commented Feb 28, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change

Comparison is base (aed8a88) 73.96% compared to head (7556e8e) 73.97%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6059   +/-   ##
=======================================
  Coverage   73.96%   73.97%           
=======================================
  Files         385      385           
  Lines       37973    37982    +9     
=======================================
+ Hits        28087    28096    +9     
- Misses       7397     7399    +2     
+ Partials     2489     2487    -2     
Flag Coverage Δ
unittests 73.97% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/utils/etcdutil/etcdutil.go 80.91% <100.00%> (+0.42%) ⬆️
pkg/utils/etcdutil/testutil.go 100.00% <100.00%> (ø)
pkg/tso/tso.go 66.85% <0.00%> (-8.99%) ⬇️
client/resource_group/controller/limiter.go 61.25% <0.00%> (-6.25%) ⬇️
pkg/id/id.go 83.05% <0.00%> (-3.39%) ⬇️
pkg/tso/allocator_manager.go 62.88% <0.00%> (-2.27%) ⬇️
server/server.go 73.97% <0.00%> (-0.64%) ⬇️
pkg/mcs/resource_manager/server/manager.go 81.13% <0.00%> (-0.63%) ⬇️
client/resource_group/controller/controller.go 61.68% <0.00%> (-0.24%) ⬇️
... and 19 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

AutoSyncInterval: autoSyncInterval,
TLS: tlsConfig,
LogConfig: &lgc,
DialKeepAliveTime: 10 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have the config item in before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add these configs to avoid failed endponit cannot switched, refer to etcd-io/etcd#7941 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bufferflies
Copy link
Contributor

bufferflies commented Feb 28, 2023

How about add one unit test for testing the leader's network has been isolated?

@@ -41,6 +41,13 @@ const (
// defaultAutoSyncInterval is the interval to sync etcd cluster.
defaultAutoSyncInterval = 60 * time.Second

// defaultDialKeepAliveTime is the time after which client pings the server to see if transport is alive.
defaultDialKeepAliveTime = 10 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default value if we don't set them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zero

@bufferflies
Copy link
Contributor

bufferflies commented Mar 2, 2023

In you experiment result,the client will send the etcd request to all the follower . But in the old behaviors, only the leader can receive the client request.
907b8efd-bf7f-4f78-9561-631ff631bc26

@lhy1024
Copy link
Contributor Author

lhy1024 commented Mar 2, 2023

In you experiment result,the client will send the etcd request to all the follower . But in the old behaviors, only the leader can receive the client request. 907b8efd-bf7f-4f78-9561-631ff631bc26

Yes, behaviour is changed. And I appended more detailed description.

@lhy1024
Copy link
Contributor Author

lhy1024 commented Mar 2, 2023

How about add one unit test for testing the leader's network has been isolated?

add a test to ingest delay by tcp reverse proxy

@lhy1024 lhy1024 requested review from nolouch and rleungx March 2, 2023 09:42
@lhy1024 lhy1024 changed the title etcdutil: fix dial keep alive etcdutil: add dial keep alive params to switch connect as soon as possible Mar 2, 2023
…sible

Signed-off-by: lhy1024 <admin@liudos.us>
Copy link
Contributor

@binshi-bing binshi-bing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left few comments

pkg/utils/etcdutil/etcdutil_test.go Outdated Show resolved Hide resolved
pkg/utils/etcdutil/etcdutil_test.go Show resolved Hide resolved
@ti-chi-bot
Copy link
Member

@binshi-bing: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

left few comments

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Copy link
Contributor

@binshi-bing binshi-bing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a critical comment

require.NoError(t, failpoint.Disable("github.com/tikv/pd/pkg/utils/etcdutil/autoSyncInterval"))
func ioCopy(dst io.Writer, src io.Reader, enableDiscard *atomic.Bool) (err error) {
buffer := make([]byte, 32*1024)
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dead loop when src.Read(buffer) returns non-zero, EOF then next Read returns 0, EOF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just direct return

Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 3, 2023
@lhy1024
Copy link
Contributor Author

lhy1024 commented Mar 3, 2023

cc @nolouch @rleungx would you like to take a look?

@lhy1024
Copy link
Contributor Author

lhy1024 commented Mar 6, 2023

There seems to be a file permission problem in ci.

2023-03-03T10:41:29.0683530Z {"level":"fatal","ts":"2023-03-03T10:41:28.949Z","caller":"etcdserver/server.go:859","msg":"failed to purge wal file","error":"open /tmp/TestEtcdWithHangLeader1350266089/003/member/wal: no such file or directory","stacktrace":"go.etcd.io/etcd/etcdserver.(*EtcdServer).purgeFile\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20220915004622-85b640cee793/etcdserver/server.go:859\ngo.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20220915004622-85b640cee793/etcdserver/server.go:2698"}

I try to run tests in dev env and it always was successful.

for num in {1..10}; do
    go clean -testcache
    go test -timeout 60s -run ^TestEtcdWithHangLeader$ github.com/tikv/pd/pkg/utils/etcdutil
done
ok  	github.com/tikv/pd/pkg/utils/etcdutil	18.191s
ok  	github.com/tikv/pd/pkg/utils/etcdutil	17.576s
ok  	github.com/tikv/pd/pkg/utils/etcdutil	17.291s
ok  	github.com/tikv/pd/pkg/utils/etcdutil	17.499s
ok  	github.com/tikv/pd/pkg/utils/etcdutil	18.181s
ok  	github.com/tikv/pd/pkg/utils/etcdutil	17.978s
ok  	github.com/tikv/pd/pkg/utils/etcdutil	17.601s
ok  	github.com/tikv/pd/pkg/utils/etcdutil	17.297s
ok  	github.com/tikv/pd/pkg/utils/etcdutil	17.486s
ok  	github.com/tikv/pd/pkg/utils/etcdutil	17.385s

Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024
Copy link
Contributor Author

lhy1024 commented Mar 6, 2023

It is stable in ci env after adding a random filename and removing parallel.

ci list: 1 2 3

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 7, 2023
@lhy1024
Copy link
Contributor Author

lhy1024 commented Mar 7, 2023

/merge

@ti-chi-bot
Copy link
Member

@lhy1024: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 975a400

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 7, 2023
@ti-chi-bot
Copy link
Member

@lhy1024: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot merged commit 253c798 into tikv:master Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

qps dorp to zero after pdleader network partition
6 participants