Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/client: fix gRPC connection pool, don't close conn when meeting error (#1196) #1198

Merged
merged 1 commit into from
Dec 11, 2020

Conversation

ti-srebot
Copy link
Contributor

cherry-pick #1196 to release-4.0


What problem does this PR solve?

In an internal test, TiCDC can't recover from a TiKV crash or recovery.

We meet endless following errors after killing a TiKV, where store=172.16.4.197:21160 is the killed TiKV server.

[2020/12/10 17:05:01.000 +08:00] [WARN] [client.go:723] ["get grpc stream client failed"] [regionID=11544] [requestID=7814] [storeID=7] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Canceled desc = grpc: the client connection is closing"]
[2020/12/10 17:05:01.000 +08:00] [INFO] [region_cache.go:600] ["mark store's regions need be refill"] [store=172.16.4.197:21160]
[2020/12/10 17:05:01.000 +08:00] [INFO] [region_cache.go:414] ["invalidate current region, because others failed on same store"] [region=12557] [store=172.16.4.197:21160]
[2020/12/10 17:05:01.000 +08:00] [INFO] [client.go:656] ["cannot get rpcCtx, retry span"] [regionID=12557] [span="[7480000000000001ff2d5f728000000000ff4bd6de0000000000fa, 7480000000000001ff2d5f728000000000ff52290a0000000000fa)"]

On the other side, recovers a TiKV server meets following error

[2020/12/10 17:23:12.931 +08:00] [INFO] [client.go:363] ["establish stream to store failed, retry later"] [addr=172.16.4.197:21160] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Canceled desc = grpc: the client connection is closing"] [errorVerbose="[CDC:ErrTiKVEventFeed]rpc error: code = Canceled desc = grpc: the client connection is closing\ngit.luolix.top/pingcap/errors.AddStack\n\tgit.luolix.top/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/errors.go:174\ngit.luolix.top/pingcap/errors.(*Error).GenWithStackByCause\n\tgit.luolix.top/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/normalize.go:279\ngit.luolix.top/pingcap/ticdc/pkg/errors.WrapError\n\tgit.luolix.top/pingcap/ticdc/pkg/errors/helper.go:28\ngit.luolix.top/pingcap/ticdc/cdc/kv.(*CDCClient).newStream.func1\n\tgit.luolix.top/pingcap/ticdc/cdc/kv/client.go:362\ngit.luolix.top/pingcap/ticdc/pkg/retry.Run.func1\n\tgit.luolix.top/pingcap/ticdc/pkg/retry/retry.go:32\ngit.luolix.top/cenkalti/backoff.RetryNotify\n\tgit.luolix.top/cenkalti/backoff@v2.2.1+incompatible/retry.go:37\ngit.luolix.top/cenkalti/backoff.Retry\n\tgit.luolix.top/cenkalti/backoff@v2.2.1+incompatible/retry.go:24\ngit.luolix.top/pingcap/ticdc/pkg/retry.Run\n\tgit.luolix.top/pingcap/ticdc/pkg/retry/retry.go:31\ngit.luolix.top/pingcap/ticdc/cdc/kv.(*CDCClient).newStream\n\tgit.luolix.top/pingcap/ticdc/cdc/kv/client.go:346\ngit.luolix.top/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest\n\tgit.luolix.top/pingcap/ticdc/cdc/kv/client.go:720\ngit.luolix.top/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func1\n\tgit.luolix.top/pingcap/ticdc/cdc/kv/client.go:477\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20200625203802-6e8e738ad208/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1374"]

What is changed and how it works?

We don't close gPRC conn here, let it goes into TransientFailure. If the store recovers, the gPRC conn can be reused.

TODO:

  • Add an integration test later
  • Refine the gRPC connetion pool

Check List

Tests

  • Unit test
  • Integration test

Release note

  • Fix a bug that TiCDC could fail to continue replicating when a TiKV crashes or recovers from a crash, the bug exists in v4.0.8 only.

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot ti-srebot added type/bugfix This PR fixes a bug. priority/P0 The issue has P0 priority. status/ptal Could you please take a look? type/4.0-cherry-pick labels Dec 11, 2020
@ti-srebot ti-srebot added this to the v4.0.10 milestone Dec 11, 2020
Copy link
Contributor

@amyangfei amyangfei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Dec 11, 2020
@amyangfei
Copy link
Contributor

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 11, 2020
@ti-srebot
Copy link
Contributor Author

/run-all-tests

@amyangfei amyangfei modified the milestones: v4.0.10, v4.0.9 Dec 11, 2020
@codecov-io
Copy link

Codecov Report

Merging #1198 (78c0e74) into release-4.0 (402379a) will decrease coverage by 0.0868%.
The diff coverage is n/a.

@@                 Coverage Diff                 @@
##           release-4.0      #1198        +/-   ##
===================================================
- Coverage      39.7839%   39.6971%   -0.0869%     
===================================================
  Files              112        112                
  Lines            11756      11754         -2     
===================================================
- Hits              4677       4666        -11     
- Misses            6609       6617         +8     
- Partials           470        471         +1     

@ti-srebot ti-srebot merged commit 1cfde41 into pingcap:release-4.0 Dec 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/P0 The issue has P0 priority. status/can-merge Indicates a PR has been approved by a committer. status/LGT1 Indicates that a PR has LGTM 1. status/ptal Could you please take a look? type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants