Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework cancellation counting #1181

Merged
merged 2 commits into from
Jul 31, 2023

Conversation

Quinn-With-Two-Ns
Copy link
Contributor

Summary

The Go SDK notoriously tries to count commands and this has lead to all sorts of issues in the Go SDK with the counter getting out of sync. At some point we introduced workflowExecutionIsCancelling to try to correct some counting issues when the workflow execution was cancelled. The reason workflow cancellation was messing with the counting is because when the workflow was cancelled and the cancel event was processed by the WFT state machine all the state-machines that were waiting on that context were generating their respective cancellation command. All these cancel commands were being generated before the Go SDK had processed the WFT started event for this WFT. This is problematic because the WFT started event sets the counter to workflowTaskStartedEventID +2 so all those command went uncounted. The solution was to count these cancelled command separately and add them to the total.

The Go SDK approach to cancellation also causes issues when the workflow is cancelled in the same WFT as other events that may resolve other state machines like TimerCompleted since the cancellation happens before those event are processed and state machines are resolved.

Even with the 2+ years of work on this there are still issues with how this correction is implemented #1176.

I explored changing how cancellation is processed so it is delayed until the WFT state machine processes the WFT started event. In the Go SDK the only safe place to generate commands is after a WFT started event is processed. Unfortunately that is a breaking change because if the workflow is cancelled in the same WFT as a event that resolves a state machine is sent the order they are processed matter.

Since we can't change how cancellation is processed I think the next best solution is base the correction on the unsent commands generated before the WFT started event was processed, rather then keep a separate counter.

This should not be a breaking change to any workflow that worked before this change, but I welcome any new tests to prove this

There is a more history here:
#726

There was also a test in this issue (that we couldn't add due to CPU usage?) that I ran as well

@Quinn-With-Two-Ns Quinn-With-Two-Ns requested a review from a team as a code owner July 28, 2023 18:24
Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me and is definitely cleaner than before

test/workflow_test.go Outdated Show resolved Hide resolved
Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the tests pass, I support any fixing of counters in any way. Will hold off approval until debug print is removed.

var uncountedCommands int64
for curr := h.orderedCommands.Front(); curr != nil; {
d := curr.Value.(commandStateMachine)
command := d.getCommand()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish this call were createCommand(). It was unclear to me what was happening here until I understood that's what this is.

@Quinn-With-Two-Ns
Copy link
Contributor Author

@cretz done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants