[Bug][Registry] Optimizing waiting strategy #15223

Gallardot · 2023-11-24T10:41:01Z

Purpose of the pull request

Due to high load on the zk server or network issues, a worker may experience heartbeat timeout. We can alleviate this issue by increasing the SESSION-TIMEOUT. However, when this issue occurs, it is still necessary to handle the reconnection event properly.

From the source code, we know that according to the WorkerWaitingStrategy, a disconnect will be triggered first, followed by a reconnect. The disconnect will stop the worker's RPC service. After successfully reconnecting to zk, a reconnect will be triggered, restarting the RPC service.

At this point, we will get the following exception information:

[WARN] 2023-11-17 13:15:11.831 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[125] - [WorkflowInstance-0][TaskInstance-0] - Worker server clear the tasks due to lost connection from registry
[WARN] 2023-11-17 13:15:11.831 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[127] - [WorkflowInstance-0][TaskInstance-0] - Worker server clear the retry message due to lost connection from registry
[INFO] 2023-11-17 13:15:11.832 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[69] - [WorkflowInstance-0][TaskInstance-0] - Worker disconnect from registry will try to reconnect in 100 s
[INFO] 2023-11-17 13:15:11.832 +0800 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[46] - [WorkflowInstance-0][TaskInstance-0] - Registry reconnected
[INFO] 2023-11-17 13:15:11.832 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener:[42] - [WorkflowInstance-0][TaskInstance-0] - Worker received a RECONNECTED event from registry, the current server state is WAITING
[INFO] 2023-11-17 13:15:11.833 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[40] - [WorkflowInstance-0][TaskInstance-0] - WorkerRpcServer starting...
[ERROR] 2023-11-17 13:15:11.833 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[108] - [WorkflowInstance-0][TaskInstance-0] - Recover from waiting failed, the current server status is RUNNING, will stop the server
java.lang.IllegalStateException: group set already
	at io.netty.bootstrap.AbstractBootstrap.group(AbstractBootstrap.java:92)
	at io.netty.bootstrap.ServerBootstrap.group(ServerBootstrap.java:83)
	at org.apache.dolphinscheduler.extract.base.NettyRemotingServer.start(NettyRemotingServer.java:91)
	at org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer.start(WorkerRpcServer.java:41)
	at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.reStartWorkerResource(WorkerWaitingStrategy.java:133)
	at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.reconnect(WorkerWaitingStrategy.java:100)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:208)
	at com.sun.proxy.$Proxy117.reconnect(Unknown Source)
	at org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener.onUpdate(WorkerConnectionStateListener.java:50)
	at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:47)
	at org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:281)
	at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92)
	at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89)
	at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89)
	at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:281)
	at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
	at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:134)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

This means that the restart of the RPC service failed and the worker will stop the service. At this point, we need to restart the worker service.

A better strategy would be not to stop the RPC service during disconnect, so there is no need to handle the restart of the RPC service during reconnection.

Also, add a logic judgment in the worker's dispatcher to determine whether the current worker is in an available state. If it is not available, it will no longer accept tasks.

Similarly, the master should also make the same optimization in the WaitingStrategy.

Brief change log

Verify this pull request

This pull request is code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(or)

If your pull request contain incompatible change, you should also add it to docs/docs/en/guide/upgrede/incompatible.md

Signed-off-by: Gallardot <gallardot@apache.org>

codecov-commenter · 2023-11-24T11:01:41Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (fd74cf1) 38.11% compared to head (2154b6c) 38.12%.

❗ Current head 2154b6c differs from pull request most recent head cf01087. Consider uploading reports for the commit cf01087 to get more accurate results

Files	Patch %	Lines
...perator/TaskInstanceDispatchOperationFunction.java	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##                dev   #15223   +/-   ##
=========================================
  Coverage     38.11%   38.12%           
  Complexity     4697     4697           
=========================================
  Files          1299     1299           
  Lines         44783    44774    -9     
  Branches       4798     4799    +1     
=========================================
  Hits          17068    17068           
+ Misses        25864    25855    -9     
  Partials       1851     1851

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ruanwenjun · 2023-11-24T12:22:40Z

Good job, I changed the title to Bug

fuchanghai

LGTM, The problem was described very clearly，learn from you

Signed-off-by: Gallardot <gallardot@apache.org>

sonarcloud · 2023-12-19T06:23:26Z

Please retry analysis of this Pull-Request directly on SonarCloud

ruanwenjun

LGTM
The link https://twitter.com/dolphinschedule 400, is not related this this pr.

Signed-off-by: Gallardot <gallardot@apache.org>

sonarcloud · 2024-01-02T11:17:42Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
0.0% Coverage on New Code
10.0% Duplication on New Code

See analysis details on SonarCloud

* [Improvement][Registry] Optimizing waiting strategy Signed-off-by: Gallardot <gallardot@apache.org>

[Bug][Registry] Optimizing waiting strategy (apache#15223) See merge request logan/devops/apache/dolphinscheduler!11

[Improvement][Registry] Optimizing waiting strategy

8a09333

Signed-off-by: Gallardot <gallardot@apache.org>

github-actions bot added the backend label Nov 24, 2023

ruanwenjun added the bug Something isn't working label Nov 24, 2023

ruanwenjun changed the title ~~[Improvement][Registry] Optimizing waiting strategy~~ [Bug][Registry] Optimizing waiting strategy Nov 24, 2023

Merge branch 'dev' into RECONNECTED

d1eb68c

Gallardot marked this pull request as ready for review November 27, 2023 12:11

Gallardot requested review from caishunfeng, SbloodyS and ruanwenjun as code owners November 27, 2023 12:11

fuchanghai approved these changes Nov 28, 2023

View reviewed changes

Gallardot added 3 commits December 1, 2023 10:35

Merge branch 'dev' into RECONNECTED

df0e5cc

[Improvement][Registry] Optimizing waiting strategy

8d99e31

Signed-off-by: Gallardot <gallardot@apache.org>

Merge branch 'dev' into RECONNECTED

cde00f0

ruanwenjun previously approved these changes Dec 17, 2023

View reviewed changes

ruanwenjun added the ready-to-merge label Dec 17, 2023

ruanwenjun assigned Gallardot Dec 17, 2023

Gallardot dismissed ruanwenjun’s stale review via e8bc468 December 19, 2023 02:32

Gallardot force-pushed the RECONNECTED branch from e8bc468 to 340565e Compare December 19, 2023 02:33

[Improvement][Registry] Optimizing waiting strategy

fc3873a

Signed-off-by: Gallardot <gallardot@apache.org>

Gallardot force-pushed the RECONNECTED branch from 340565e to fc3873a Compare December 19, 2023 05:58

ruanwenjun previously approved these changes Dec 21, 2023

View reviewed changes

Gallardot added 4 commits December 21, 2023 20:12

Merge branch 'dev' into RECONNECTED

0d8fc5b

Merge branch 'dev' into RECONNECTED

d785437

Merge branch 'dev' into RECONNECTED

8a70432

chore: fix conflicts

876e90d

Signed-off-by: Gallardot <gallardot@apache.org>

Gallardot dismissed ruanwenjun’s stale review via 876e90d December 28, 2023 06:09

Merge branch 'dev' into RECONNECTED

a1a49b2

Radeity and others added 2 commits January 2, 2024 17:53

Merge branch 'dev' into RECONNECTED

a51d617

Merge branch 'dev' into RECONNECTED

cf01087

Radeity added this to the 3.2.1 milestone Jan 2, 2024

Radeity approved these changes Jan 2, 2024

View reviewed changes

Radeity merged commit 575b89e into apache:dev Jan 2, 2024
54 checks passed

Gallardot deleted the RECONNECTED branch January 3, 2024 01:56

Gallardot added a commit to Gallardot/dolphinscheduler that referenced this pull request Mar 14, 2024

[Bug][Registry] Optimizing waiting strategy (apache#15223)

6c045d7

* [Improvement][Registry] Optimizing waiting strategy Signed-off-by: Gallardot <gallardot@apache.org>

Gallardot pushed a commit to Gallardot/dolphinscheduler that referenced this pull request Mar 14, 2024

Merge branch 'RECONNECTED' into 'develop'

a78aed3

[Bug][Registry] Optimizing waiting strategy (apache#15223) See merge request logan/devops/apache/dolphinscheduler!11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][Registry] Optimizing waiting strategy #15223

[Bug][Registry] Optimizing waiting strategy #15223

Gallardot commented Nov 24, 2023 •

edited

Loading

codecov-commenter commented Nov 24, 2023 •

edited

Loading

ruanwenjun commented Nov 24, 2023

fuchanghai left a comment •

edited

Loading

sonarcloud bot commented Dec 19, 2023

ruanwenjun left a comment

sonarcloud bot commented Jan 2, 2024

[Bug][Registry] Optimizing waiting strategy #15223

[Bug][Registry] Optimizing waiting strategy #15223

Conversation

Gallardot commented Nov 24, 2023 • edited Loading

Purpose of the pull request

Brief change log

Verify this pull request

codecov-commenter commented Nov 24, 2023 • edited Loading

Codecov Report

ruanwenjun commented Nov 24, 2023

fuchanghai left a comment • edited Loading

Choose a reason for hiding this comment

sonarcloud bot commented Dec 19, 2023

ruanwenjun left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Jan 2, 2024

Quality Gate passed

Gallardot commented Nov 24, 2023 •

edited

Loading

codecov-commenter commented Nov 24, 2023 •

edited

Loading

fuchanghai left a comment •

edited

Loading