Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FINERACT-1724: Stucked COB job after restart Batch manager #3944

Conversation

alberto-art3ch
Copy link
Contributor

Description

The problem occurs only when one of the batch step of the manager got stucked / or remains in “STARTED” state and the batch manager got restarted.

Our stuck logic should recognize whether we are waiting on a worker batch step or waiting on a manager batch step. For the first one we should wait, but for the second we might need to restart or mark it as failed and restart the job

FINERACT-1724

Screenshot 2024-06-20 at 5 05 29 p m

Checklist

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Write the commit message as per https://github.com/apache/fineract/#pull-requests

  • Acknowledge that we will not review PRs that are not passing the build ("green") - it is your responsibility to get a proposed PR to pass the build, not primarily the project's maintainers.

  • Create/update unit or integration tests for verifying the changes made.

  • Follow coding conventions at https://cwiki.apache.org/confluence/display/FINERACT/Coding+Conventions.

  • Add required Swagger annotation and update API documentation at fineract-provider/src/main/resources/static/legacy-docs/apiLive.htm with details of any API changes

  • Submission is not a "code dump". (Large changes can be made "in repository" via a branch. Ask on the developer mailing list for guidance, if required.)

FYI our guidelines for code reviews are at https://cwiki.apache.org/confluence/display/FINERACT/Code+Review+Guide.

@alberto-art3ch alberto-art3ch force-pushed the fix/loan_cob_loop_after_crash_manager branch from efebc3a to 71a42c2 Compare June 21, 2024 03:27
if (areThereStuckJobs(jobName)) {
final List<Long> stuckJobIds = getStuckJobIds(jobName);
// If It is a Batch Manager
if (properties.getMode().isBatchManagerEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is correct. It will mark complete all the business steps that are executing on the batch workers.

Checking whether we are the batch manager is useless. The stuck job executor only be called in batch manager.

Here what we need is to decide whether the stuck batch steps are started on the batch manager or on the batch worker. If it is a batch manager steps, and it is STARTED. We can mark them as FAILED hence the batch manager was restarted. If the found batch steps are batch worker business steps, the batch manager shall wait till they got finished before restarting the job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! It has been updated

@@ -111,4 +133,57 @@ private boolean areAllPartitionsCompleted(Long stuckJobId, String partitionerSte
Long notCompletedPartitions = jobExecutionRepository.getNotCompletedPartitionsCount(stuckJobId, partitionerStepName);
return notCompletedPartitions == 0L;
}

private void markAsCompleted(long executionId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot mark it as COMPLETED... it was not COMPLETES. We shall mark it as FAILED!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, Status will be Failed

Copy link
Contributor

@adamsaghy adamsaghy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please kindly see my review!

@alberto-art3ch alberto-art3ch force-pushed the fix/loan_cob_loop_after_crash_manager branch 2 times, most recently from 36e665b to 87f615a Compare June 21, 2024 15:20
@alberto-art3ch
Copy link
Contributor Author

Please kindly see my review!

PR Updated

restartTaskletJobs(stuckJobIds);
if (areThereStuckJobs(jobName)) {
final List<Long> stuckJobIds = getStuckJobIds(jobName);
stuckJobIds.forEach(stuckJobId -> markAsFailed(stuckJobId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still incorrect to immediately mark the job as failed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original handling was correct, but there was a "flaw":

It got stuck forever if it is waiting on its own batch step to finish:
Example:

  • Job got started on the Batch manager (Loan COB)
  • Batch manager starts the "Loan COB partition - Step" (in DB it status goes to STARTING or STARTED), but it got interrupted and batch manager got restarted.
  • When the batch manager restarted it check is there any "running" batch step and will find the "Loan COB partition - Step" and it starts waiting till it got completed (which never happens hence the batch manager was restarted).

Solution

if the batch manager was restarted and one of the batch step that is belong to the Manager (batch manager started that batch step!) is in STARTING / STARTED, mark it as failed and do not wait till got finished (which never happens)

@alberto-art3ch alberto-art3ch force-pushed the fix/loan_cob_loop_after_crash_manager branch from 87f615a to 5896c1b Compare June 24, 2024 16:56
Copy link

This pull request seems to be stale. Are you still planning to work on it? We will automatically close it in 30 days.

@github-actions github-actions bot added the stale label Jul 25, 2024
@github-actions github-actions bot closed this Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants