-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tikv scale in failure in some cases #726
Fix tikv scale in failure in some cases #726
Conversation
/run-e2e-tests |
/run-e2e-tests |
1 similar comment
/run-e2e-tests |
// | ||
// 2. This can happen when TiKV pod has not been successfully registered in the cluster, such as always pending. | ||
// In this situation we should delete this TiKV pod immediately to avoid blocking the subsequent operations. | ||
if !podutil.IsPodReady(pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only handle Pending
pods other than all not ready pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only pending, for example, considering this situation, the newly launched pod has been crashing and never really joined the tidb cluster. In this case, we can also scale in safely.
This logic is not so much Scale-In as it is updating the status of TiKV pods (and then when in Scale-In mode deciding to do something with that status). One reason the logic is placed here is to avoid returning an error and halting the sync process. However, in my PR #581 by raising a Note that my PR incidentally fixed a few minor bugs showing up in test cases due to dealing with halting sync at the right time. |
Here we need to scale in immediately when it can be reduced safely, instead of generating a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@tennix @xiaojingchen PTAL
/run-e2e-tests |
cherry pick to release-1.0 in PR #742 |
What problem does this PR solve?
This PR fixes #725
What is changed and how does it work?
Check List
Tests
Code changes
Side effects
Related changes
Does this PR introduce a user-facing change?: