Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: destroy monitor if no more components in host #857

Merged
merged 14 commits into from
Nov 4, 2020

Conversation

9547
Copy link
Contributor

@9547 9547 commented Oct 21, 2020

What problem does this PR solve?

fix #842

What is changed and how it works?

After scale-in component, check whether the host has any more components, if not, stop and destroy the monitor components.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change
  • Has persistent data change

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release notes:

NONE

@codecov-io
Copy link

codecov-io commented Oct 21, 2020

Codecov Report

Merging #857 into master will increase coverage by 42.71%.
The diff coverage is 44.02%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #857       +/-   ##
===========================================
+ Coverage   10.76%   53.47%   +42.71%     
===========================================
  Files         130      263      +133     
  Lines        9900    19047     +9147     
===========================================
+ Hits         1066    10186     +9120     
+ Misses       8595     7286     -1309     
- Partials      239     1575     +1336     
Flag Coverage Δ
cluster 45.45% <42.58%> (?)
dm 25.24% <4.40%> (?)
integrate 48.30% <44.02%> (+37.53%) ⬆️
playground 22.17% <ø> (?)
tiup 10.76% <ø> (+<0.01%) ⬆️
unittest 20.93% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/dm/command/scale_in.go 67.64% <0.00%> (ø)
pkg/cluster/operation/scale_in.go 59.18% <23.80%> (ø)
pkg/cluster/operation/action.go 56.36% <45.00%> (ø)
pkg/cluster/operation/destroy.go 54.89% <49.12%> (ø)
pkg/utils/mock/mock.go 8.00% <0.00%> (ø)
components/dm/spec/bindversion.go 100.00% <0.00%> (ø)
components/playground/main.go 69.50% <0.00%> (ø)
components/cluster/command/display.go 27.41% <0.00%> (ø)
pkg/logger/audit.go 53.33% <0.00%> (ø)
pkg/cluster/task/builder.go 97.10% <0.00%> (ø)
... and 223 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c9ac6f...cdb1dae. Read the comment docs.

@9547
Copy link
Contributor Author

9547 commented Oct 23, 2020

@AstroProfundis PTAL

Copy link
Member

@lucklove lucklove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contributes, 9547, it really awesome!

components/dm/command/scale_in.go Outdated Show resolved Hide resolved
pkg/cluster/operation/scale_in.go Outdated Show resolved Hide resolved
pkg/cluster/operation/scale_in.go Show resolved Hide resolved
@9547 9547 force-pushed the fix/clean-monitor-if-no-more-components branch from d4f69cd to caef688 Compare October 29, 2020 16:59
@9547
Copy link
Contributor Author

9547 commented Oct 29, 2020

@lucklove PTAL

pkg/cluster/operation/destroy.go Outdated Show resolved Hide resolved
pkg/cluster/operation/destroy.go Outdated Show resolved Hide resolved
@9547 9547 force-pushed the fix/clean-monitor-if-no-more-components branch from caef688 to ae9c6d1 Compare October 30, 2020 15:46
@9547
Copy link
Contributor Author

9547 commented Oct 30, 2020

@lucklove PTAL again

@lonng lonng requested a review from AstroProfundis November 2, 2020 02:21
Copy link
Member

@lucklove lucklove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

pkg/cluster/operation/destroy.go Outdated Show resolved Hide resolved
pkg/cluster/operation/scale_in.go Outdated Show resolved Hide resolved
@9547 9547 force-pushed the fix/clean-monitor-if-no-more-components branch from ae9c6d1 to 91e75c1 Compare November 2, 2020 14:13
@9547
Copy link
Contributor Author

9547 commented Nov 3, 2020

@lucklove Please help me about the test failure, curious about the test failure reason (exit 255), and it runs successful on my local dev env

@lucklove
Copy link
Member

lucklove commented Nov 3, 2020

@lucklove Please help me about the test failure, curious about the test failure reason (exit 255), and it runs successful on my local dev env

It's because of this:

fail to wait instance number reach 20, count 21, retry num: 120

But I'm not sure why the scale-in action not work this time, invesgating

@lucklove
Copy link
Member

lucklove commented Nov 3, 2020

It's because there is a TiKV node that is still in offline process:

172.19.0.102:20160  tikv            172.19.0.102  20160/20180                      linux/x86_64  Pending Offline

wait_instance_num_reach $name $total_sub_one $native_ssh
tiup-cluster $client --yes scale-in $name -N $ipprefix.102:20160
echo "after sclae in(without prune) tikv, the monitors should not be destroyed"
wait_instance_num_reach $name $total_sub_one $native_ssh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after sclae in(without prune) tikv, the monitors should not be destroyed

waiit_instance_num_reach will call prune automaticlly, this is not what you expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why it was passed in my local dev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll dig into later tonight, and thanks for your help 😀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It passed in your local dev, I think it may related to the opportunity:

  • Only when the tikv's state is tombstone, the prune command cleanup it
  • After scale in a tikv node, it will take some time to transfer to tombstone state (seconds to minutes)
    So I guess your local host, when the wait_instance_num_reach execute, the tikv node is still in pending offline state, so it passed, but in CI, the tikv node is tombstone, so it was cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Seems in Github CI, it's hard to predicate when TiKV will finish the transferring process and go to tombstone state, after that prune can evict the TiKV.
  2. have reserved only one TIDB node on the 102 node for testing the issue When you scale-in PD, node_exporter of the PD node is still present #842

@9547 9547 force-pushed the fix/clean-monitor-if-no-more-components branch from c3995e1 to 4428fa3 Compare November 3, 2020 16:09
@9547 9547 force-pushed the fix/clean-monitor-if-no-more-components branch from 4428fa3 to c742444 Compare November 3, 2020 16:32
It's hard to predicate when tikv will be pruned in Github CI
@9547 9547 force-pushed the fix/clean-monitor-if-no-more-components branch from 4c1b429 to 58e43da Compare November 3, 2020 23:24
Copy link
Member

@lucklove lucklove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Nov 4, 2020
@lonng lonng merged commit 4489a8d into pingcap:master Nov 4, 2020
@lucklove lucklove added this to the v1.2.4 milestone Nov 19, 2020
@9547 9547 deleted the fix/clean-monitor-if-no-more-components branch April 6, 2021 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/LGT1 Indicates that a PR has LGTM 1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When you scale-in PD, node_exporter of the PD node is still present
7 participants