Fix tikv scale in failure in some cases #726

onlymellb · 2019-08-01T19:37:42Z

What problem does this PR solve?

This PR fixes #725

What is changed and how does it work?

Check List

Tests

Unit test
Manual test (add detailed scripts or steps below)

Code changes

Has Go code change

Side effects

Related changes

Need to cherry-pick to the release branch

Does this PR introduce a user-facing change?:

Fix tikv scale in failure in some cases after tikv failover

onlymellb · 2019-08-02T03:24:12Z

/run-e2e-tests

onlymellb · 2019-08-02T04:02:28Z

/run-e2e-tests

onlymellb · 2019-08-02T08:45:36Z

/run-e2e-tests

weekface · 2019-08-02T13:07:40Z

pkg/manager/member/tikv_scaler.go

+	//
+	// 2. This can happen when TiKV pod has not been successfully registered in the cluster, such as always pending.
+	//    In this situation we should delete this TiKV pod immediately to avoid blocking the subsequent operations.
+	if !podutil.IsPodReady(pod) {


I think we should only handle Pending pods other than all not ready pods.

Not only pending, for example, considering this situation, the newly launched pod has been crashing and never really joined the tidb cluster. In this case, we can also scale in safely.

gregwebs · 2019-08-02T18:52:14Z

This logic is not so much Scale-In as it is updating the status of TiKV pods (and then when in Scale-In mode deciding to do something with that status). One reason the logic is placed here is to avoid returning an error and halting the sync process. However, in my PR #581 by raising a RequeueError one does not stop the sync loop. It is possible to do something similar here.

Note that my PR incidentally fixed a few minor bugs showing up in test cases due to dealing with halting sync at the right time.

onlymellb · 2019-08-03T02:19:39Z

Here we need to scale in immediately when it can be reduced safely, instead of generating a RequeueError waiting for the next round of sync (consider this situation, if the tikv pod has been pending or unhealthy，it has never really joined the tidb cluster during this period.), this situation can lead to an inability to scale in, no matter how many sync loops have passed. refer to issue #725

weekface

LGTM

@tennix @xiaojingchen PTAL

onlymellb · 2019-08-08T06:37:08Z

/run-e2e-tests

sre-bot · 2019-08-08T07:38:40Z

cherry pick to release-1.0 in PR #742

luolibin added 3 commits August 2, 2019 03:28

fix typo

698a72a

fix tikv scale in failure in some cases

dd53d0f

update test cases

3db2c6e

onlymellb requested review from weekface, xiaojingchen and tennix August 1, 2019 19:37

onlymellb added the needs-cherry-pick-1.0 label Aug 2, 2019

Merge remote-tracking branch 'src/master' into fix-tikv-scale-in-failed

1221e19

fix CI

a23895b

weekface reviewed Aug 2, 2019

View reviewed changes

weekface approved these changes Aug 5, 2019

View reviewed changes

Merge remote-tracking branch 'src/master' into fix-tikv-scale-in-failed

5f78e81

xiaojingchen approved these changes Aug 8, 2019

View reviewed changes

xiaojingchen merged commit 799930d into pingcap:master Aug 8, 2019

sre-bot mentioned this pull request Aug 8, 2019

Fix tikv scale in failure in some cases (#726) #742

Merged

onlymellb deleted the fix-tikv-scale-in-failed branch August 8, 2019 07:39

weekface pushed a commit that referenced this pull request Aug 9, 2019

Fix tikv scale in failure in some cases (#726) (#742)

a5a995f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tikv scale in failure in some cases #726

Fix tikv scale in failure in some cases #726

onlymellb commented Aug 1, 2019

onlymellb commented Aug 2, 2019

onlymellb commented Aug 2, 2019

onlymellb commented Aug 2, 2019

weekface Aug 2, 2019

onlymellb Aug 3, 2019

gregwebs commented Aug 2, 2019

onlymellb commented Aug 3, 2019

weekface left a comment

onlymellb commented Aug 8, 2019

sre-bot commented Aug 8, 2019

Fix tikv scale in failure in some cases #726

Fix tikv scale in failure in some cases #726

Conversation

onlymellb commented Aug 1, 2019

What problem does this PR solve?

What is changed and how does it work?

Check List

Does this PR introduce a user-facing change?:

onlymellb commented Aug 2, 2019

onlymellb commented Aug 2, 2019

onlymellb commented Aug 2, 2019

weekface Aug 2, 2019

Choose a reason for hiding this comment

onlymellb Aug 3, 2019

Choose a reason for hiding this comment

gregwebs commented Aug 2, 2019

onlymellb commented Aug 3, 2019

weekface left a comment

Choose a reason for hiding this comment

onlymellb commented Aug 8, 2019

sre-bot commented Aug 8, 2019