-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][Master]serial_wait strategy workflow unable to wake up #15270
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #15270 +/- ##
=========================================
Coverage 38.11% 38.11%
- Complexity 4696 4697 +1
=========================================
Files 1299 1299
Lines 44775 44777 +2
Branches 4797 4797
=========================================
+ Hits 17066 17067 +1
- Misses 25861 25863 +2
+ Partials 1848 1847 -1 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Gallardot <gallardot@apache.org>
Signed-off-by: Gallardot <gallardot@apache.org>
8e914b4
to
50fbffb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this is a hot fix PR.
Serial wait notify logic should be refactored, other with if the notify failed, the seaitl wait workflow instance will always in running.
This is an analysis of a bug related to the serial wait strategy, which causes the workflow instance to remain in a waiting state indefinitely.When a workflow's scheduled strategy is There is a certain probability that this problem will occur. The analysis of the cause is as follows: The Everything seems fine. But there is a specific situation. That is, a workflow instance is about to complete, and a workflow instance is being created. Problems will arise at this time. Because of the isolation of transactions, the My solution is to use a new transaction for updating the status of the workflow instance in the Lines 291 to 316 in 0f7081b
Lines 326 to 342 in bd48c99
Lines 790 to 832 in bd48c99
@ruanwenjun @Radeity @EricGao888 @SbloodyS @fuchanghai @qingwli @caishunfeng PTAL. |
agree with @ruanwenjun |
hi @Gallardot Can you associate the issue with this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @Gallardot I remember that for nested things, you need to use a proxy object to make it effective. and need add EnableAspectJAutoProxy
on Application cc @ruanwenjun
There's already some issue from the community. |
Which part does it mean? |
The transaction has been opened on the handleCommand method, and I am not sure whether the new transaction in this method will take effect. AFAIK, this is a nested transaction and needs to be enabled for annotations to take effect. ```@ EnableAspectJAutoProxy(exposeProxy = true) ```,Maybe I'm wrong, can you test whether the database data is updated after running the upsert method? cc @EricGao888 @ruanwenjun |
@Gallardot Sorry I didn't notice EnableTransactionManagement, and LGTM |
As I mentioned in the Purpose of the pull request, this issue is caused by the fact that For the usage of the |
Quality Gate failedFailed conditions 50.0% Coverage on New Code (required ≥ 60%) |
…15270) * fix: serial_wait strategy workflow unable to wake up Signed-off-by: Gallardot <gallardot@apache.org> * fix: serial_wait strategy workflow unable to wake up Signed-off-by: Gallardot <gallardot@apache.org> --------- Signed-off-by: Gallardot <gallardot@apache.org> Co-authored-by: fuchanghai <changhaifu@apache.org>
[Bug][Master]serial_wait strategy workflow unable to wake up (apache#15270) See merge request logan/devops/apache/dolphinscheduler!13
Purpose of the pull request
This is an analysis of a bug related to the serial wait strategy, which causes the workflow instance to remain in a waiting state indefinitely.
When a workflow's scheduled strategy is
SERIAL_WAIT
, if a workflow instance's status is WAITING, then this workflow instance will remain in a waiting state, even if the previous workflow instance has already completed execution.There is a certain probability that this problem will occur.
The analysis of the cause is as follows: The
MasterSchedulerBootstrap
thread processes commands through thehandleCommand
method. Note that thishandleCommand
is within a transaction. In this transaction, thesaveSerialProcess
method is used to modify the status of the workflow instance. However, At the same time, in another thread pool ofWorkflowExecuteRunnable
, thecheckSerialProcess
method is used to check the status of the workflow instance in order to wake up the workflow instance in a waiting state.Everything seems fine. But there is a specific situation. That is, a workflow instance is about to complete, and a workflow instance is being created. Problems will arise at this time. Because of the isolation of transactions, the
saveSerialProcess
in thehandleCommand
method may have just been executed, but it has not yet been committed. At this time, thecheckSerialProcess
method will not be able to check that the status of this workflow instance is WAITING, causing this workflow instance to remain in a waiting state and cannot be awakened.My solution is to use a new transaction for updating the status of the workflow instance in the
handleCommand
transaction. This can avoid the above problem. I have been running this in my environment for two months, and the problem has not reoccurreddolphinscheduler/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/process/ProcessServiceImpl.java
Lines 291 to 316 in 0f7081b
dolphinscheduler/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/process/ProcessServiceImpl.java
Lines 326 to 342 in bd48c99
dolphinscheduler/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/runner/WorkflowExecuteRunnable.java
Lines 790 to 832 in bd48c99
Brief change log
Verify this pull request
This pull request is code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(or)
If your pull request contain incompatible change, you should also add it to
docs/docs/en/guide/upgrede/incompatible.md