Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The timeout should consider the transfer latency especially in big cluster #5596

Closed
bufferflies opened this issue Oct 13, 2022 · 0 comments
Closed
Labels
affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.4 affects-6.6 affects-7.0 affects-7.1 This bug affects the 7.1.x(LTS) versions. severity/major type/bug The issue is confirmed as a bug.

Comments

@bufferflies
Copy link
Contributor

bufferflies commented Oct 13, 2022

Bug Report

In big cluster, the duration of region heartbeat is bigger than 10s. For example:
the region heartbeat qps for every store : 1w ops
there are 10w region in this store, so the latency is 10w/1w=10s
but some short operator step timeout limit is 10s, such as promote/demote/transfer leader, so the operator failed because the duration is bigger than 10s.

some log:

[2022/10/13 00:52:23.560 +00:00] [INFO] [operator_controller.go:590] ["operator timeout"] [region-id=1820635949] [takes=1m20.047056718s] [operator="\"balance-region {mv peer: store [14057] to [14053]} (kind:region, region:1820635949(540, 1763), createAt:2022-10-13 00:51:03.513592708 +0000 UTC m=+461704.590481929, startAt:2022-10-13 00:51:03.513683079 +0000 UTC m=+461704.590572320, currentStep:1, size:16, steps:[add learner peer 1832839605 on store 14053, use joint consensus, promote learner peer 1832839605 on store 14053 to voter, demote voter peer 1820635950 on store 14057 to learner, leave joint state, promote learner peer 1832839605 on store 14053 to voter, demote voter peer 1820635950 on store 14057 to learner, remove peer on store 14057]) timeout\""] [additional-info="{\"sourceScore\":\"5538250.44\",\"targetScore\":\"5519343.11\"}"]

there are 5w regions in each store.

the operator step duration
image
the duration of region heartbeat
image

What did you do?

What did you expect to see?

operators are successful.

What did you see instead?

operators are failed

What version of PD are you using (pd-server -V)?

master, v6.1.0

ti-chi-bot added a commit that referenced this issue Nov 4, 2022
…parated (#5600)

ref #5596

Signed-off-by: bufferflies <1045931706@qq.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 4, 2022
ref tikv#5596

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@VelocityLight VelocityLight added the affects-6.5 This bug affects the 6.5.x(LTS) versions. label Dec 2, 2022
ti-chi-bot added a commit that referenced this issue Jan 18, 2023
…parated (#5600) (#5679)

ref #5596, ref #5600

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: bufferflies <1045931706@qq.com>

Co-authored-by: buffer <1045931706@qq.com>
Co-authored-by: bufferflies <1045931706@qq.com>
@VelocityLight VelocityLight added affects-6.6 and removed affects-6.5 This bug affects the 6.5.x(LTS) versions. labels Feb 6, 2023
@VelocityLight VelocityLight added the affects-7.1 This bug affects the 7.1.x(LTS) versions. label Apr 20, 2023
@nolouch nolouch closed this as completed May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.4 affects-6.6 affects-7.0 affects-7.1 This bug affects the 7.1.x(LTS) versions. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants