-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][Registry] Optimizing waiting strategy #15223
Conversation
Signed-off-by: Gallardot <gallardot@apache.org>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #15223 +/- ##
=========================================
Coverage 38.11% 38.12%
Complexity 4697 4697
=========================================
Files 1299 1299
Lines 44783 44774 -9
Branches 4798 4799 +1
=========================================
Hits 17068 17068
+ Misses 25864 25855 -9
Partials 1851 1851 ☔ View full report in Codecov by Sentry. |
Good job, I changed the title to Bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, The problem was described very clearly,learn from you
Signed-off-by: Gallardot <gallardot@apache.org>
e8bc468
to
340565e
Compare
Signed-off-by: Gallardot <gallardot@apache.org>
340565e
to
fc3873a
Compare
Please retry analysis of this Pull-Request directly on SonarCloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The link https://twitter.com/dolphinschedule 400, is not related this this pr.
Signed-off-by: Gallardot <gallardot@apache.org>
Quality Gate passedKudos, no new issues were introduced! 0 New issues |
* [Improvement][Registry] Optimizing waiting strategy Signed-off-by: Gallardot <gallardot@apache.org>
[Bug][Registry] Optimizing waiting strategy (apache#15223) See merge request logan/devops/apache/dolphinscheduler!11
Purpose of the pull request
Due to high load on the zk server or network issues, a worker may experience heartbeat timeout. We can alleviate this issue by increasing the SESSION-TIMEOUT. However, when this issue occurs, it is still necessary to handle the reconnection event properly.
From the source code, we know that according to the WorkerWaitingStrategy, a disconnect will be triggered first, followed by a reconnect. The disconnect will stop the worker's RPC service. After successfully reconnecting to zk, a reconnect will be triggered, restarting the RPC service.
At this point, we will get the following exception information:
This means that the restart of the RPC service failed and the worker will stop the service. At this point, we need to restart the worker service.
A better strategy would be not to stop the RPC service during disconnect, so there is no need to handle the restart of the RPC service during reconnection.
Also, add a logic judgment in the worker's dispatcher to determine whether the current worker is in an available state. If it is not available, it will no longer accept tasks.
Similarly, the master should also make the same optimization in the WaitingStrategy.
Brief change log
Verify this pull request
This pull request is code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(or)
If your pull request contain incompatible change, you should also add it to
docs/docs/en/guide/upgrede/incompatible.md