-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FINERACT-1724: Stucked COB job after restart Batch manager #3944
FINERACT-1724: Stucked COB job after restart Batch manager #3944
Conversation
efebc3a
to
71a42c2
Compare
if (areThereStuckJobs(jobName)) { | ||
final List<Long> stuckJobIds = getStuckJobIds(jobName); | ||
// If It is a Batch Manager | ||
if (properties.getMode().isBatchManagerEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think this is correct. It will mark complete all the business steps that are executing on the batch workers.
Checking whether we are the batch manager is useless. The stuck job executor only be called in batch manager.
Here what we need is to decide whether the stuck batch steps are started on the batch manager or on the batch worker. If it is a batch manager steps, and it is STARTED. We can mark them as FAILED hence the batch manager was restarted. If the found batch steps are batch worker business steps, the batch manager shall wait till they got finished before restarting the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! It has been updated
@@ -111,4 +133,57 @@ private boolean areAllPartitionsCompleted(Long stuckJobId, String partitionerSte | |||
Long notCompletedPartitions = jobExecutionRepository.getNotCompletedPartitionsCount(stuckJobId, partitionerStepName); | |||
return notCompletedPartitions == 0L; | |||
} | |||
|
|||
private void markAsCompleted(long executionId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot mark it as COMPLETED... it was not COMPLETES. We shall mark it as FAILED!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, Status will be Failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please kindly see my review!
36e665b
to
87f615a
Compare
PR Updated |
restartTaskletJobs(stuckJobIds); | ||
if (areThereStuckJobs(jobName)) { | ||
final List<Long> stuckJobIds = getStuckJobIds(jobName); | ||
stuckJobIds.forEach(stuckJobId -> markAsFailed(stuckJobId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still incorrect to immediately mark the job as failed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original handling was correct, but there was a "flaw":
It got stuck forever if it is waiting on its own batch step to finish:
Example:
- Job got started on the Batch manager (Loan COB)
- Batch manager starts the "Loan COB partition - Step" (in DB it status goes to STARTING or STARTED), but it got interrupted and batch manager got restarted.
- When the batch manager restarted it check is there any "running" batch step and will find the "Loan COB partition - Step" and it starts waiting till it got completed (which never happens hence the batch manager was restarted).
Solution
if the batch manager was restarted and one of the batch step that is belong to the Manager (batch manager started that batch step!) is in STARTING / STARTED, mark it as failed and do not wait till got finished (which never happens)
87f615a
to
5896c1b
Compare
This pull request seems to be stale. Are you still planning to work on it? We will automatically close it in 30 days. |
Description
The problem occurs only when one of the batch step of the manager got stucked / or remains in “STARTED” state and the batch manager got restarted.
Our stuck logic should recognize whether we are waiting on a worker batch step or waiting on a manager batch step. For the first one we should wait, but for the second we might need to restart or mark it as failed and restart the job
FINERACT-1724
Checklist
Please make sure these boxes are checked before submitting your pull request - thanks!
Write the commit message as per https://github.com/apache/fineract/#pull-requests
Acknowledge that we will not review PRs that are not passing the build ("green") - it is your responsibility to get a proposed PR to pass the build, not primarily the project's maintainers.
Create/update unit or integration tests for verifying the changes made.
Follow coding conventions at https://cwiki.apache.org/confluence/display/FINERACT/Coding+Conventions.
Add required Swagger annotation and update API documentation at fineract-provider/src/main/resources/static/legacy-docs/apiLive.htm with details of any API changes
Submission is not a "code dump". (Large changes can be made "in repository" via a branch. Ask on the developer mailing list for guidance, if required.)
FYI our guidelines for code reviews are at https://cwiki.apache.org/confluence/display/FINERACT/Code+Review+Guide.