Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic to recover from quarantined and completed states #13071

Merged
merged 6 commits into from
May 24, 2022

Conversation

lmossman
Copy link
Contributor

@lmossman lmossman commented May 21, 2022

What

Resolves #10932

How

Adds logic to handle quarantined and "completed" states of workflows, by detecting workflows in this state and properly reporting them as "unreachable". This PR also adds logic to attempt termination of a workflow before restarting it. This is important for cases like the quarantined case where we cannot start a fresh workflow for the connection until we have terminated the existing one.

Recommended reading order

  1. x.java
  2. y.python

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels May 21, 2022
@lmossman lmossman temporarily deployed to more-secrets May 21, 2022 00:32 Inactive
Comment on lines +184 to +186
// A non-deleted workflow being in a COMPLETED state is unexpected, and should be corrected
throw new UnreachableWorkflowException(
String.format("ConnectionManagerWorkflow for connection %s is unreachable due to having COMPLETED status.", connectionId));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this case because when testing I saw an occurrence of a workflow with COMPLETED status that was not deleted. Since it was COMPLETED, it's state could still be queried, but no signals could be executed on it. This logic corrects this by correctly treating workflows in this state as "unreachable"

@lmossman lmossman marked this pull request as ready for review May 21, 2022 00:35
@lmossman lmossman requested review from benmoriceau and alovew May 21, 2022 00:35
Copy link
Contributor

@benmoriceau benmoriceau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nit comment

} catch (final Exception e) {
log.warn(
"Could not terminate temporal workflow due to the following error; "
+ "this may be because there is currently running workflow for this connection.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "no running workflow" here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I did, good catch

@lmossman lmossman temporarily deployed to more-secrets May 23, 2022 21:46 Inactive
@lmossman lmossman temporarily deployed to more-secrets May 23, 2022 21:58 Inactive
@lmossman lmossman temporarily deployed to more-secrets May 24, 2022 00:47 Inactive
@lmossman lmossman temporarily deployed to more-secrets May 24, 2022 00:55 Inactive
@lmossman lmossman merged commit e7c26d6 into master May 24, 2022
@lmossman lmossman deleted the lmossman/recover-from-quarantined branch May 24, 2022 02:51
jscottpolevault pushed a commit to jscottpolevault/airbyte that referenced this pull request Jun 1, 2022
…#13071)

* add logic to recover from quarantined and completed states

* move status retrieval into try-catch

* fix typo in log

* add one more tests

* mvoe isWorkflowStateRunning into ConnectionManagerUtils to be more direct

* format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Recover from quarantine and version state issue
2 participants