Repair temporal state when performing manual actions #12289

lmossman · 2022-04-23T00:52:14Z

What

Resolves #10931
Resolves #11216
Resolves #11213
Resolves #12160

The main goal of this PR is to fix our interactions with connection manager workflow in the TemporalClient. After some investigation, I found that our TemporalClient was not correctly handling the various states that a workflow can be in.

Namely, if a connection and its workflow have been deleted, the workflow and its state are still retrievable using the WorkflowClient, but signal methods cannot be executed on the workflow because it is in state "Completed" (this was the root cause of this issue). Separately, if a workflow is terminated, then its state cannot be retrieved at all.
This PR makes both of these cases more explicit through new exceptions, and updates the various manual operation methods to handle these exceptions and attempt to automatically repair workflows that are in bad states.

How

By refactoring the TemporalClient and using the new repairAndRetrieveConnectionManagerWorkflow() method, this PR makes each manual operation correctly handle the deleted case, and automatically repair the workflow in the "unexpected state" case.

The plan is to follow this up with another PR to address the start from a clean state ticket, so that when these automatic workflow repairs are performed, they do not cause any weird issues with job state.
I plan to wait to merge this PR until that second PR has been reviewed and merged into this one, at which point they will be merged to master together, because we should really add both at the same time to be safe.

This PR also renames some classes and methods to be more consistent, and adds more TemporalClient tests.

Recommended reading order

TemporalClient.java
TemporalClientTest.java
The rest is mostly side effects of refactoring the above

lmossman · 2022-04-26T02:02:53Z

airbyte-workers/src/main/java/io/airbyte/workers/temporal/TemporalClient.java

+   * @throws DeletedWorkflowException if the workflow was deleted, according to the workflow state
+   * @throws UnreachableWorkflowException if the workflow is unreachable
+   */
+  private ConnectionManagerWorkflow getConnectionManagerWorkflow(final UUID connectionId)


This method and the following method are the main changes in this PR - these methods offer the rest of this class a way to retrieve the connection manager workflow while also enforcing that they handle the deleted case and the unreachable workflow case through exceptions. The second method repairAndRetrieveConnectionManagerWorkflow handles the unreachable case automatically by automatically restarting the temporal workflow.

cgardens

Nice! I like this approach alot. Made a couple clean up suggestions but otherwise it looks good!

airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java

cgardens · 2022-04-26T03:47:20Z

airbyte-workers/src/main/java/io/airbyte/workers/temporal/TemporalClient.java

+    final ConnectionManagerWorkflow connectionManagerWorkflow;
+    try {
+      connectionManagerWorkflow = repairAndRetrieveConnectionManagerWorkflow(connectionId);
+    } catch (final DeletedWorkflowException e) {
+      log.error("Can't cancel a deleted workflow");
+      return new ManualOperationResult(
+          Optional.of(e.getMessage()),
          Optional.empty());
    }


this block of code is repeated a bunch of times in this class (except for the contents of the log message). can we DRY it?

I think this is a hard piece of logic to DRY, because this logic means that this method should return a ManualOperationResult in the case of a DeletedWorkflowException. So if I try to move this logic into another method, that method will need to somehow indicate to this method that it should return a ManualOperationResult in that case, or return a ConnectionManagerWorkflow in the normal case, and this method will need to handle both cases with a conditional. So I think either way, this will be somewhat ugly if we try to DRY it.

cgardens · 2022-04-26T03:48:47Z

...te-workers/src/main/java/io/airbyte/workers/temporal/exception/DeletedWorkflowException.java

+
+package io.airbyte.workers.temporal.exception;
+
+public class DeletedWorkflowException extends Exception {


maybe helpful for this exception and the other one to add a javadoc comment explaining what they mean and why they might occur.

benmoriceau · 2022-04-26T21:33:21Z

airbyte-workers/src/main/java/io/airbyte/workers/temporal/TemporalClient.java

    }

-    final ConnectionManagerWorkflow connectionManagerWorkflow =
-        getExistingWorkflow(ConnectionManagerWorkflow.class, getConnectionManagerName(connectionId));
+    if (workflowState.isDeleted()) {


I don't understand how this block is working. I means than for most of the deleted workflow, we will consider them as Unreachable because the workflow are suppose to be terminated.

@benmoriceau One thing I discovered while investigating this issue is that when we delete a connection, the temporal workflow is actually not terminated. The temporal workflow in the delete case has status Completed, which is a separate status from Terminated. For these deleted workflows, we can actually still retrieve the workflow and read values from its workflowState, but we cannot call any signal methods on the workflow because it is not actively running.

I think the reason the temporal workflow is Completed and not Terminated in the deletion case is because we just return in that case:

airbyte/airbyte-workers/src/main/java/io/airbyte/workers/temporal/scheduling/ConnectionManagerWorkflowImpl.java

Lines 105 to 109 in f816946

if (workflowState.isDeleted()) {

log.info("Workflow deletion was requested. Calling deleteConnection activity before terminating the workflow.");

deleteConnectionBeforeTerminatingTheWorkflow();

return;

}

And I think when the run method returns, temporal marks the workflow as Completed

So my new understanding is that the "unreachable" case is not ever really expected, which is why we repair the workflow in that case

I see and for a Terminated workflow, I guess we can't query the state.

Yep that's correct

benmoriceau · 2022-04-26T22:21:08Z

I plan to wait to merge this PR until that second PR has been reviewed and merged into this one, at which point they will be merged to master together, because we should really add both at the same time to be safe.

This looks fine to me with this assumption. I have one more concern about the fact that we are starting the workflow asynchronously is that I am wondering if we should use https://www.javadoc.io/static/io.temporal/temporal-sdk/1.0.3/io/temporal/client/WorkflowClient.html#newSignalWithStartRequest-- and https://www.javadoc.io/static/io.temporal/temporal-sdk/1.0.3/io/temporal/client/WorkflowClient.html#signalWithStart-io.temporal.client.BatchRequest- to ensure that there won't be any race condition between the workflow being accessible and the signal being send after a restart.

lmossman · 2022-04-26T22:30:27Z

I plan to wait to merge this PR until that second PR has been reviewed and merged into this one, at which point they will be merged to master together, because we should really add both at the same time to be safe.

This looks fine to me with this assumption. I have one more concern about the fact that we are starting the workflow asynchronously is that I am wondering if we should use https://www.javadoc.io/static/io.temporal/temporal-sdk/1.0.3/io/temporal/client/WorkflowClient.html#newSignalWithStartRequest-- and https://www.javadoc.io/static/io.temporal/temporal-sdk/1.0.3/io/temporal/client/WorkflowClient.html#signalWithStart-io.temporal.client.BatchRequest- to ensure that there won't be any race condition between the workflow being accessible and the signal being send after a restart.

@benmoriceau to make sure I understand correctly, you are suggesting that in these cases where we restart the workflow before sending a signal, we should instead submit the signal with the start request in the same BatchRequest, like the current deleteConnection implementation is doing here, correct? I think this is a good callout, I hadn't considered that. I can look into that change

benmoriceau · 2022-04-27T21:09:08Z

I plan to wait to merge this PR until that second PR has been reviewed and merged into this one, at which point they will be merged to master together, because we should really add both at the same time to be safe.

This looks fine to me with this assumption. I have one more concern about the fact that we are starting the workflow asynchronously is that I am wondering if we should use https://www.javadoc.io/static/io.temporal/temporal-sdk/1.0.3/io/temporal/client/WorkflowClient.html#newSignalWithStartRequest-- and https://www.javadoc.io/static/io.temporal/temporal-sdk/1.0.3/io/temporal/client/WorkflowClient.html#signalWithStart-io.temporal.client.BatchRequest- to ensure that there won't be any race condition between the workflow being accessible and the signal being send after a restart.

@benmoriceau to make sure I understand correctly, you are suggesting that in these cases where we restart the workflow before sending a signal, we should instead submit the signal with the start request in the same BatchRequest, like the current deleteConnection implementation is doing here, correct? I think this is a good callout, I hadn't considered that. I can look into that change

@lmossman yes it is exactly that. Temporal will ensure that the workflow is reachable before submitting the signal to it.

benmoriceau · 2022-04-29T21:56:02Z

@benmoriceau I was able to finalize the stuff that you and I paired on during sit together - I think it ended up having a decent and fairly DRY structure. Lmk what you think

LGTM, I have approved it.

CLAassistant · 2022-05-05T13:10:24Z

All committers have signed the CLA.

…ow (#12589) * first working iteration of cleaning job state on first workflow run * second iteration, with tests * undo local testing changes * move method * add comment explaining placement of clean job state logic * change connection_workflow failure origin value to platform * remove cast from new query * create static var for non terminal job statuses * change failure origin value to airbyte_platform * tweak external message wording * remove unused variable * reword external message * fix merge conflict * remove log lines * move cleaning job state to beginning of workflow * do not clean job state if there is already a job id for this workflow, and add test * see if sleeping fixes test on CI * add repeated test annotation to protect from flakiness * fail jobs before creating new ones to protect from quarantined state * update external message for cleaning job state error

* Repair temporal state when performing manual actions * refactor temporal client and fix tests * add unreachable workflow exception * format * test repeated deletion * add acceptance tests for automatic workflow repair * rename and DRY up manual operation methods in SchedulerHandler * refactor temporal client to batch signal and start requests together in repair case * add comment * remove main method * fix job id fetching * only overwrite workflowState if reset flags are true on input * fix test * fix cancel endpoint * Clean job state before creating new jobs in connection manager workflow (#12589) * first working iteration of cleaning job state on first workflow run * second iteration, with tests * undo local testing changes * move method * add comment explaining placement of clean job state logic * change connection_workflow failure origin value to platform * remove cast from new query * create static var for non terminal job statuses * change failure origin value to airbyte_platform * tweak external message wording * remove unused variable * reword external message * fix merge conflict * remove log lines * move cleaning job state to beginning of workflow * do not clean job state if there is already a job id for this workflow, and add test * see if sleeping fixes test on CI * add repeated test annotation to protect from flakiness * fail jobs before creating new ones to protect from quarantined state * update external message for cleaning job state error

Repair temporal state when performing manual actions

8375de5

github-actions bot added area/platform issues related to the platform area/scheduler area/server area/worker Related to worker labels Apr 23, 2022

lmossman temporarily deployed to more-secrets April 23, 2022 00:53 Inactive

lmossman temporarily deployed to more-secrets April 23, 2022 00:54 Inactive

lmossman added 2 commits April 25, 2022 18:20

refactor temporal client and fix tests

00939f2

add unreachable workflow exception

721007b

lmossman temporarily deployed to more-secrets April 26, 2022 01:24 Inactive

format

01572ad

lmossman temporarily deployed to more-secrets April 26, 2022 01:30 Inactive

lmossman added 2 commits April 25, 2022 18:50

test repeated deletion

c5ab61f

add acceptance tests for automatic workflow repair

2d3f806

lmossman marked this pull request as ready for review April 26, 2022 01:59

lmossman requested review from benmoriceau and cgardens April 26, 2022 01:59

lmossman temporarily deployed to more-secrets April 26, 2022 02:01 Inactive

lmossman commented Apr 26, 2022

View reviewed changes

cgardens approved these changes Apr 26, 2022

View reviewed changes

benmoriceau reviewed Apr 26, 2022

View reviewed changes

rename and DRY up manual operation methods in SchedulerHandler

16c7a16

lmossman temporarily deployed to more-secrets April 26, 2022 22:49 Inactive

fix job id fetching

95c4bfa

lmossman temporarily deployed to more-secrets May 2, 2022 22:36 Inactive

only overwrite workflowState if reset flags are true on input

925a421

lmossman force-pushed the lmossman/repair-unexpected-temporal-state branch from 37b0fdc to 925a421 Compare May 2, 2022 22:44

lmossman temporarily deployed to more-secrets May 2, 2022 22:46 Inactive

lmossman mentioned this pull request May 4, 2022

Clean job state before creating new jobs in connection manager workflow #12589

Merged

fix test

7fabc1b

lmossman temporarily deployed to more-secrets May 5, 2022 21:35 Inactive

fix cancel endpoint

261bc05

lmossman temporarily deployed to more-secrets May 5, 2022 23:01 Inactive

This was referenced May 10, 2022

Automatically repair terminated connection manager workflows #12746

Closed

Do not allow any operation on deleted connection #10840

Closed

Merge branch 'master' into lmossman/repair-unexpected-temporal-state

1f11152

lmossman temporarily deployed to more-secrets May 10, 2022 19:52 Inactive

github-actions bot added the area/api Related to the api label May 13, 2022

lmossman merged commit e8084c0 into master May 13, 2022

lmossman deleted the lmossman/repair-unexpected-temporal-state branch May 13, 2022 00:43

lmossman temporarily deployed to more-secrets May 13, 2022 00:44 Inactive

octavia-squidington-iii mentioned this pull request May 13, 2022

Bump Airbyte version from 0.38.2-alpha to 0.38.3-alpha #12839

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair temporal state when performing manual actions #12289

Repair temporal state when performing manual actions #12289

lmossman commented Apr 23, 2022 •

edited

Loading

lmossman Apr 26, 2022

cgardens left a comment

cgardens Apr 26, 2022

lmossman Apr 27, 2022

cgardens Apr 26, 2022

benmoriceau Apr 26, 2022

lmossman Apr 26, 2022 •

edited

Loading

lmossman Apr 26, 2022

benmoriceau Apr 26, 2022

lmossman Apr 26, 2022

benmoriceau commented Apr 26, 2022

lmossman commented Apr 26, 2022

benmoriceau commented Apr 27, 2022

benmoriceau commented Apr 29, 2022

CLAassistant commented May 5, 2022 •

edited

Loading


		package io.airbyte.workers.temporal.exception;

		public class DeletedWorkflowException extends Exception {

	if (workflowState.isDeleted()) {
	log.info("Workflow deletion was requested. Calling deleteConnection activity before terminating the workflow.");
	deleteConnectionBeforeTerminatingTheWorkflow();
	return;
	}

Repair temporal state when performing manual actions #12289

Repair temporal state when performing manual actions #12289

Conversation

lmossman commented Apr 23, 2022 • edited Loading

What

How

Recommended reading order

Choose a reason for hiding this comment

cgardens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmossman Apr 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benmoriceau commented Apr 26, 2022

lmossman commented Apr 26, 2022

benmoriceau commented Apr 27, 2022

benmoriceau commented Apr 29, 2022

CLAassistant commented May 5, 2022 • edited Loading

lmossman commented Apr 23, 2022 •

edited

Loading

lmossman Apr 26, 2022 •

edited

Loading

CLAassistant commented May 5, 2022 •

edited

Loading