Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][Registry] Optimizing waiting strategy #15223

Merged
merged 13 commits into from
Jan 2, 2024
Merged

Conversation

Gallardot
Copy link
Member

@Gallardot Gallardot commented Nov 24, 2023

Purpose of the pull request

Due to high load on the zk server or network issues, a worker may experience heartbeat timeout. We can alleviate this issue by increasing the SESSION-TIMEOUT. However, when this issue occurs, it is still necessary to handle the reconnection event properly.

From the source code, we know that according to the WorkerWaitingStrategy, a disconnect will be triggered first, followed by a reconnect. The disconnect will stop the worker's RPC service. After successfully reconnecting to zk, a reconnect will be triggered, restarting the RPC service.

At this point, we will get the following exception information:

[WARN] 2023-11-17 13:15:11.831 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[125] - [WorkflowInstance-0][TaskInstance-0] - Worker server clear the tasks due to lost connection from registry
[WARN] 2023-11-17 13:15:11.831 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[127] - [WorkflowInstance-0][TaskInstance-0] - Worker server clear the retry message due to lost connection from registry
[INFO] 2023-11-17 13:15:11.832 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[69] - [WorkflowInstance-0][TaskInstance-0] - Worker disconnect from registry will try to reconnect in 100 s
[INFO] 2023-11-17 13:15:11.832 +0800 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[46] - [WorkflowInstance-0][TaskInstance-0] - Registry reconnected
[INFO] 2023-11-17 13:15:11.832 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener:[42] - [WorkflowInstance-0][TaskInstance-0] - Worker received a RECONNECTED event from registry, the current server state is WAITING
[INFO] 2023-11-17 13:15:11.833 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[40] - [WorkflowInstance-0][TaskInstance-0] - WorkerRpcServer starting...
[ERROR] 2023-11-17 13:15:11.833 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[108] - [WorkflowInstance-0][TaskInstance-0] - Recover from waiting failed, the current server status is RUNNING, will stop the server
java.lang.IllegalStateException: group set already
	at io.netty.bootstrap.AbstractBootstrap.group(AbstractBootstrap.java:92)
	at io.netty.bootstrap.ServerBootstrap.group(ServerBootstrap.java:83)
	at org.apache.dolphinscheduler.extract.base.NettyRemotingServer.start(NettyRemotingServer.java:91)
	at org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer.start(WorkerRpcServer.java:41)
	at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.reStartWorkerResource(WorkerWaitingStrategy.java:133)
	at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.reconnect(WorkerWaitingStrategy.java:100)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:208)
	at com.sun.proxy.$Proxy117.reconnect(Unknown Source)
	at org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener.onUpdate(WorkerConnectionStateListener.java:50)
	at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:47)
	at org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:281)
	at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92)
	at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89)
	at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89)
	at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:281)
	at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
	at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:134)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

This means that the restart of the RPC service failed and the worker will stop the service. At this point, we need to restart the worker service.

A better strategy would be not to stop the RPC service during disconnect, so there is no need to handle the restart of the RPC service during reconnection.

Also, add a logic judgment in the worker's dispatcher to determine whether the current worker is in an available state. If it is not available, it will no longer accept tasks.

Similarly, the master should also make the same optimization in the WaitingStrategy.

Brief change log

Verify this pull request

This pull request is code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(or)

If your pull request contain incompatible change, you should also add it to docs/docs/en/guide/upgrede/incompatible.md

Signed-off-by: Gallardot <gallardot@apache.org>
@codecov-commenter
Copy link

codecov-commenter commented Nov 24, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (fd74cf1) 38.11% compared to head (2154b6c) 38.12%.

❗ Current head 2154b6c differs from pull request most recent head cf01087. Consider uploading reports for the commit cf01087 to get more accurate results

Files Patch % Lines
...perator/TaskInstanceDispatchOperationFunction.java 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##                dev   #15223   +/-   ##
=========================================
  Coverage     38.11%   38.12%           
  Complexity     4697     4697           
=========================================
  Files          1299     1299           
  Lines         44783    44774    -9     
  Branches       4798     4799    +1     
=========================================
  Hits          17068    17068           
+ Misses        25864    25855    -9     
  Partials       1851     1851           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ruanwenjun ruanwenjun added the bug Something isn't working label Nov 24, 2023
@ruanwenjun ruanwenjun changed the title [Improvement][Registry] Optimizing waiting strategy [Bug][Registry] Optimizing waiting strategy Nov 24, 2023
@ruanwenjun
Copy link
Member

Good job, I changed the title to Bug

@Gallardot Gallardot marked this pull request as ready for review November 27, 2023 12:11
Copy link
Member

@fuchanghai fuchanghai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, The problem was described very clearly,learn from you

ruanwenjun
ruanwenjun previously approved these changes Dec 17, 2023
Signed-off-by: Gallardot <gallardot@apache.org>
Copy link

sonarcloud bot commented Dec 19, 2023

Please retry analysis of this Pull-Request directly on SonarCloud

ruanwenjun
ruanwenjun previously approved these changes Dec 21, 2023
Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
The link https://twitter.com/dolphinschedule 400, is not related this this pr.

Copy link

sonarcloud bot commented Jan 2, 2024

@Radeity Radeity added this to the 3.2.1 milestone Jan 2, 2024
@Radeity Radeity merged commit 575b89e into apache:dev Jan 2, 2024
54 checks passed
@Gallardot Gallardot deleted the RECONNECTED branch January 3, 2024 01:56
Gallardot added a commit to Gallardot/dolphinscheduler that referenced this pull request Mar 14, 2024
* [Improvement][Registry] Optimizing waiting strategy

Signed-off-by: Gallardot <gallardot@apache.org>
Gallardot pushed a commit to Gallardot/dolphinscheduler that referenced this pull request Mar 14, 2024
[Bug][Registry] Optimizing waiting strategy (apache#15223)

See merge request logan/devops/apache/dolphinscheduler!11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend bug Something isn't working ready-to-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants