Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcs: reorganize cluster start and stop process #7155

Merged
merged 5 commits into from
Oct 9, 2023

Conversation

rleungx
Copy link
Member

@rleungx rleungx commented Sep 26, 2023

What problem does this PR solve?

Issue Number: Close #7140, close #7106

What is changed and how does it work?

This PR reorganizes the cluster start/stop process and fix the race.

Check List

Tests

  • Unit test

Release note

None.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Sep 26, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • JmPotato
  • lhy1024

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. labels Sep 26, 2023
@ti-chi-bot ti-chi-bot bot requested review from JmPotato and lhy1024 September 26, 2023 07:45
@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 26, 2023
@codecov
Copy link

codecov bot commented Sep 26, 2023

Codecov Report

Merging #7155 (7ff12a2) into master (849d80d) will increase coverage by 0.02%.
Report is 5 commits behind head on master.
The diff coverage is 60.65%.

@@            Coverage Diff             @@
##           master    #7155      +/-   ##
==========================================
+ Coverage   74.58%   74.61%   +0.02%     
==========================================
  Files         441      441              
  Lines       47292    47388      +96     
==========================================
+ Hits        35275    35358      +83     
+ Misses       8940     8934       -6     
- Partials     3077     3096      +19     
Flag Coverage Δ
unittests 74.61% <60.65%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

continue
}
}

log.Info("schedulers updating notifier is triggered, try to update the scheduler")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If stop server here, is there data race?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is the same as the current PD.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other word,is it possible to meet data race when add scheduler and coordinator wait at the same time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so but the possibility is much smaller than before.

Copy link
Member Author

@rleungx rleungx Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way: we can check the cluster status before adding a scheduler every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there is still a gap between check status and adding scheduluer, if stop server here after checking the cluster status and before adding scheduler, it is possible to meet data race too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the way we use the wait group for the scheduler controller is not proper instead of the wait group itself.

@ti-chi-bot ti-chi-bot bot added status/LGT1 Indicates that a PR has LGTM 1. and removed do-not-merge/needs-linked-issue labels Sep 26, 2023
Signed-off-by: Ryan Leung <rleungx@gmail.com>
Signed-off-by: Ryan Leung <rleungx@gmail.com>
@rleungx
Copy link
Member Author

rleungx commented Oct 8, 2023

@JmPotato PTAL

return
case <-ticker.C:
// retry
notifier <- struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible we have a deadlock here? Since the length of the channel is only 1 and if the scheduler config watcher just sent it before, it could be blocked here.

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Copy link
Member

@JmPotato JmPotato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest LGTM.

Comment on lines 243 to 247
select {
case notifier <- struct{}{}:
// If the channel is not empty, it means the check is triggered.
default:
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about warping a trySend function to reuse the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Signed-off-by: Ryan Leung <rleungx@gmail.com>
@rleungx rleungx requested a review from JmPotato October 9, 2023 02:27
@ti-chi-bot ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Oct 9, 2023
@rleungx
Copy link
Member Author

rleungx commented Oct 9, 2023

/merge

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Oct 9, 2023

@rleungx: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Oct 9, 2023

This pull request has been accepted and is ready to merge.

Commit hash: 891a322

@ti-chi-bot ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Oct 9, 2023
@ti-chi-bot ti-chi-bot bot merged commit 2556b5b into tikv:master Oct 9, 2023
@rleungx rleungx deleted the reorg-cluster branch October 9, 2023 06:14
rleungx added a commit to rleungx/pd that referenced this pull request Dec 1, 2023
close tikv#7106, close tikv#7140

Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestAPI/TestAPIForward is unstable mcs: data race about scheduler controller
3 participants